{"title": "Approximating Hierarchical MV-sets for Hierarchical Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 999, "page_last": 1007, "abstract": "The goal of hierarchical clustering is to construct a cluster tree, which can be viewed as the modal structure of a density. For this purpose, we use a convex optimization program that can efficiently estimate a family of hierarchical dense sets in high-dimensional distributions. We further extend existing graph-based methods to approximate the cluster tree of a distribution. By avoiding direct density estimation, our method is able to handle high-dimensional data more efficiently than existing density-based approaches. We present empirical results that demonstrate the superiority of our method over existing ones.", "full_text": "Approximating Hierarchical MV-sets for Hierarchical\n\nClustering\n\nAssaf Glazer\n\nOmer Weissbrod\n\nDepartment of Computer Science, Technion - Israel Institute of Technology\n\n{assafgr,omerw,mic,shaulm}@cs.technion.ac.il\n\nMichael Lindenbaum\n\nShaul Markovitch\n\nAbstract\n\nThe goal of hierarchical clustering is to construct a cluster tree, which can be\nviewed as the modal structure of a density. For this purpose, we use a convex op-\ntimization program that can ef\ufb01ciently estimate a family of hierarchical dense sets\nin high-dimensional distributions. We further extend existing graph-based meth-\nods to approximate the cluster tree of a distribution. By avoiding direct density\nestimation, our method is able to handle high-dimensional data more ef\ufb01ciently\nthan existing density-based approaches. We present empirical results that demon-\nstrate the superiority of our method over existing ones.\n\n1\n\nIntroduction\n\nData clustering is a classic unsupervised learning technique, whose goal is dividing input data into\ndisjoint sets. Standard clustering methods attempt to divide input data into discrete partitions. In\nHierarchical clustering, the goal is to \ufb01nd nested partitions of the data. The nested partitions reveal\nthe modal structure of the data density, where clusters are associated with dense regions, separated\nby relatively sparse ones [27, 13].\nUnder the nonparametric assumption that the data is sampled i.i.d. from a continuous distribution\nF with Lebesgue density f in Rd, Hartigan observed that f has a hierarchical structure, called its\ncluster tree. Denote Lf (c) = {x : f (x) \u2265 c} as the level set of f at level c. Then, the connected\ncomponents in Lf (c) are the high-density clusters at level c, and the collection of all high-density\nclusters for c \u2265 0 has a hierarchical structure, where for any two clusters A and B, either A \u2286 B,\n\nB \u2286 A, or A(cid:84) B = \u2205.\n\nFigure 1: A univariate, tri-modal density function and its corresponding cluster tree are illustrated.\nFigure 1 shows a plot of a univariate, tri-modal density function. The cluster tree of the density\nfunction is shown on top of the density function. The high-density clusters are nodes in the cluster\ntree. Leaves are associated with modes in the density function.\n\n1\n\n\uf028\uf0290.11fLc\uf03d\uf028\uf0290.23fLc\uf03d66.7%33.3%66.7%33.3%0.67F\uf03d0.5F\uf03d\fGiven the density f, the cluster tree can be constructured in a straightforward manner via a recursive\nalgorithm [23]. We start by setting the root node with a single cluster containing the entire space,\ncorresponding to c = 0. We then recursively increase c until the number of connected components\nincreases, at which point we de\ufb01ne a new level of the tree. The process is repeated as long as\nthe number of connected components increases. In Figure 1, for example, the root node has two\ndaughter nodes, which were found at level c = 0.11. The next two descendants of the left node were\nfound at level c = 0.23.\nA common approach for hierarchical clustering is to \ufb01rst use a density estimation method to ob-\ntain f [18, 5, 23], and then estimate the cluster tree using the recursive method described above.\nHowever, one major drawback in this approach is that a reliable density estimation is hard to obtain,\nespecially in high-dimensional data.\nAn alternative approach is to estimate the level sets directly, without a separate density estimation\nstep. To do so, we de\ufb01ne the minimum volume set (MV-set) at level \u03b1 as the subset of the input space\nwith the smallest volume and probability mass of at least \u03b1. MV-sets of a distribution, which are\nalso level sets of the density f (under suf\ufb01cient regularity conditions), are hierarchical by de\ufb01nition.\nThe well-known One-Class SVM (OCSVM) [20] can ef\ufb01ciently \ufb01nd the MV-set at a speci\ufb01ed level\n\u03b1. A naive approach for \ufb01nding a hierarchy of MV-sets is to train distinct OCSVMs, one for each\nMV-set, and enforce hierarchy by intersection operations on the output. However, this solution is\nnot well suited for \ufb01nding a set of hierarchical MV-sets, because the natural hierarchy of MV-sets is\nnot exploited, leading to a suboptimal solution.\nIn this study we propose a novel method for constructing cluster trees by directly estimating MV-sets,\nwhile guaranteeing convergence to a globally optimum solution. Our method utilizes the q-One-\nClass SVM (q-OCSVM) method [11], which can be regarded as a natural extension of the OCSVM,\nto jointly \ufb01nd the MV-sets at a set of levels {\u03b1i}. By avoiding direct density estimation, our method\nis able to handle high-dimensional data more ef\ufb01ciently than existing density-based approaches. By\njointly considering the entire spectrum of desired levels, a globally optimum solution can be found.\nWe combine this approach with a graph-based heuristic, found to be successful in high-dimensional\ndata [2, 23], for \ufb01nding high density clusters in the approximated MV-sets. Brie\ufb02y, we construct a\nfully connected graph whose nodes correspond to feature vectors, and remove edges between nodes\nconnected by low-density regions. The connected components in the resulting graph correspond to\nhigh density clusters.\nThe advantage of our method is demonstrated empirically on synthetic and real data, including\na reconstruction of an evolutionary tree of human populations using the high-dimensional 1000\ngenomes dataset.\n\n2 Background\n\nOur novel method for hierarchical clustering belongs to a family of non-parametric clustering meth-\nods. Unlike parametric methods, which assume that each group i is associated with a density fi\nbelonging to some family of parametric densities, non-parametric methods assume that each group\nis associated with modes of a density f [27]. Non-parametric methods aim to reveal the modal\nstructure of f [13, 28, 14].\nHierarchical clustering methods can be divided into agglomerative (bottom up) and divisive (top\ndown) methods. Agglomerative methods (e.g. single-linkage) start with n singleton clusters, one for\neach training feature vector, and work by iteratively linking two closest clusters. Divisive methods,\non the other hand, start with all feature vectors in a single cluster and recursively divide clusters into\nsmaller sub-clusters.\nWhile single-linkage was found, in theory, to have better stability and convergence properties in\ncomparison to average-linkage and complete-linkage [4], it is frequently criticized by practitioners\ndue to the chaining effect. Single-linkage ignores the density of feature vectors in clusters, and thus\nmay erroneously connect two modes (clusters) with a few feature vectors connecting them, that is,\na \u2018chain\u201d of feature vectors.\nWishart [27] suggested overcoming this effect by conducting a one-level analysis of the data. The\nidea is to estimate a speci\ufb01c level set of the data density (Lf (c)), and to remove noisy features\n\n2\n\n\foutside this level that could otherwise lead to the chaining effect. The connected components left\nin Lf (c) are the clusters; expansions of this idea can be found in [9, 26, 6, 3]. Indeed, this analysis\nis more resistant to the chaining effect. However, one of its major drawbacks is that no single level\nset can reveal all the modes of the density. Therefore, various studies have proposed estimating the\nentire hierarchical structure of the data (the cluster tree) using density estimates [13, 1, 22, 18, 5, 23,\n17, 19]. These methods are considered as divisive hierarchical clustering methods, as they start by\nassociating all feature vectors to the root node, which is then recursively divided to sub-clusters by\nincrementally exploring level sets of denser regions. Our proposed method belongs to this group of\ndivisive methods.\nStuetzle [22] used the nearest neighbor density estimate to construct the cluster tree and pointed out\nits connection to single-linkage clustering. Kernel density estimates were used in other studies [23,\n19]. The bisecting K-means (BiKMean) method is another divisive method that was found to work\neffectively in cluster analysis [16], although it provides no theoretical guarantee for \ufb01nding the\ncorrect cluster tree of the underlying density.\nHierarchical clustering methods can be used as an exploration tool for data understanding [16]. The\nnonparametric assumption, by which density modes correspond to homogenous feature vectors with\nrespect to their class labels, can be used to infer the hierarchical class structure of the data [15].\nAn implicit assumption is that the closer two feature vectors are, the less likely they will be to\nhave different class labels. Interestingly, this assumption, which does not necessarily hold for all\ndistributions, is being discussed lately in the context of hierarchical sampling methods for active\nlearning [8, 7, 25], where the correctness of such a hierarchical modeling approach is said to depend\non the \u201cProbabilistic Lipschitzness\u201d assumption about the data distribution.\n\n3 Approximating MV-sets for Hierarchical Clustering\n\nOur proposed method consists of (a) estimating MV-sets using the q-OCSVM method; (b) using a\ngraph-based method for \ufb01nding a hierarchy of high density regions in the MV-sets, and (c) con-\nstructing a cluster tree using these regions. These stages are described in detail below.\n\n3.1 Estimating MV-Sets\nWe begin by brie\ufb02y describing the One-Class SVM (OCSVM) method. Let X = {x1, . . . , xn} be\na set of feature vectors sampled i.i.d. with respect to F . The function fC returned by the OCSVM\nalgorithm is speci\ufb01ed by the solution of this quadratic program:\n\n(cid:88)\n\n||w||2 \u2212 \u03c1 +\n\nw\u2208F ,\u03be\u2208Rn,\u03c1\u2208R\n\nmin\n\u03bei,\ns.t. (w \u00b7 \u03a6 (xi)) \u2265 \u03c1 \u2212 \u03bei, \u03bei \u2265 0,\n\ni\n\n1\n\u03bdn\n\n1\n2\n\n(1)\n\n(2)\n\nwhere \u03be is a vector of the slack variables. Recall that all training examples xi for which (w \u00b7 \u03a6(x))\u2212\n\u03c1 \u2264 0 are called support vectors (SVs). Outliers are referred to as examples that strictly satisfy\n(w \u00b7 \u03a6(x)) \u2212 \u03c1 < 0. By solving the program for \u03bd = 1 \u2212 \u03b1, we can use the OCSVM to approximate\nthe MV-set C(\u03b1).\nLet 0 < \u03b11 < \u03b12, . . . , < \u03b1q < 1 be a sequence of q quantiles. The q-OCSVM method generalizes\nthe OCSVM algorithm for approximating a set of MV-sets {C1, . . . , Cq} such that a hierarchy con-\nstraint Ci \u2286 Cj is satis\ufb01ed for i < j. Given X , the q-OCSVM algorithm solves this primal program:\n\n||w||2 \u2212 q(cid:88)\n\nq\n2\n\n(cid:88)\n\nq(cid:88)\n\nj=1\n\n1\n\u03bdjn\n\nmin\nw,\u03bej ,\u03c1j\ns.t. (w \u00b7 \u03a6 (xi)) \u2265 \u03c1j \u2212 \u03bej,i, \u03bej,i \u2265 0, j \u2208 [q], i \u2208 [n],\n\n\u03c1j +\n\n\u03bej,i\n\nj=1\n\ni\n\nwhere \u03bdj = 1 \u2212 \u03b1j. This program generalizes Equation (1) to the case of \ufb01nding multiple, parallel\nhalf-space decision functions by searching for a global minimum over their sum of objective func-\ntions: the coupling between q half-spaces is done by summing q OCSVM programs, while forcing\nthese programs to share the same w. As a result, the q half-spaces in the solution of Equation (2)\ndiffer only by their bias terms, and are thus parallel to each other. This program is convex, and thus\na global minimum can be found in polynomial time.\n\n3\n\n\fGlazer et al. [11] proves that the q-OCSVM algorithm can be used to approximate the MV-sets of a\ndistribution.\n\n3.1.1 Generalizing q-OCSVM for Finding an In\ufb01nite Number of Approximated MV-sets\n\nThe q-OCSVM \ufb01nds a \ufb01nite number of q approximated MV-sets, which capture the overall structure\nof the cluster tree. However, in order to better resolve differences in density levels between data\npoints, we would like the solution to be extended for de\ufb01ning an in\ufb01nite number of hierarchical sets.\nOur approach for doing so relies on the parallelism property of the approximated MV-sets in the\nq-OCSVM solution. An in\ufb01nite number of approximated MV-sets are associated with separating\nhyperplanes in F that are parallel to the q hyperplanes in the q-OCSVM solution. Note that every\nprojected feature vector \u03a6(x) lies on a unique separating hyperplane that is parallel to the q hyper-\nplanes de\ufb01ned by the solution, and the distance dis(x) = (w \u00b7 \u03a6(x)) \u2212 \u03c1 is suf\ufb01cient to determine\nwhether x is located inside each of the approximated MV-sets.\nWe would like to know the probability mass associated with each of the in\ufb01nite hyperplanes. For\nthis purpose, we could similarly estimate the expected probability mass of the approximated MV-\nset de\ufb01ned for any x \u2208 Rd. When \u03a6(x) lies strictly on one of the i \u2208 [q] hyperplanes, then x is\nconsidered as lying on the boundary of the set approximating C(\u03b1i). When \u03a6(x) does not satisfy\nthis condition, we use a linear interpolation to de\ufb01ne \u03b1 for its corresponding approximated MV-set:\nLet \u03c1i, \u03c1i+1 be the bias terms associated with the i and i + 1 approximated MV-sets that satisfy\n\u03c1i > (w \u00b7 \u03a6(x)) > \u03c1i+1. Then we linearly interpolate (w \u00b7 \u03a6(x)) along the [\u03c1i+1, \u03c1i] interval for an\nintermediate \u03b1 \u2208 (\u03b1i, \u03b1i+1). For the completion of the de\ufb01nition, we set \u03c10 = maxx\u2208X (w \u00b7 \u03a6(x))\nand \u03c1q+1 = minx\u2208X (w \u00b7 \u03a6(x)).\n\n3.2 Finding a Hierarchy of High-Density Regions\n\nTo \ufb01nd a hierarchy of high density regions, we adopt a graph-based approach. We construct a\nfully-connected graph whose nodes correspond to feature vectors, and remove edges between nodes\nseparated by low-density regions. The connected components in the resulting graph correspond to\nhigh density regions. The method proceeds as follows.\nLet \u03b1(x) be the expected probability mass of the approximated MV-set de\ufb01ned by x. Let \u03b1i,s be the\nmaximal value of \u03b1(x) over the line segment connecting the feature vectors xi and xs in X :\n\n\u03b1i,s = max\nt\u2208[0,1]\n\n\u03b1(txi + (1 \u2212 t)xs).\n\n(3)\n\nLet G be a complete graph between pairs of feature vectors in X with edges equal to \u03b1i,s\n1. High\ndensity clusters at level \u03b1 are de\ufb01ned as the connected components in the graph G(\u03b1) induced\nby removing edges from G with \u03b1i,s > \u03b1. This method guarantees that two feature vectors in\nthe same cluster of the approximated MV-set at level \u03b1 would surely lie in the same connected\ncomponent in G(\u03b1). However, the opposite would not necessary hold \u2014 when \u03b1i,s > \u03b1 and a\ncurve connecting xi and xs exists in the cluster, xi and xs might erroneously be found in different\nconnected components. Nevertheless, it was empirically shown that erroneous splits of clusters are\nrare if the density function is smooth [23].\nOne way to implement this method for \ufb01nding high density clusters is to iteratively \ufb01nd connected\ncomponents in G(\u03b1), when at each iteration \u03b1 is incrementally increased (starting from \u03b1 = 0),\nuntil all the clusters are found. However, [23] observed that we can simplify this method by working\nonly on the graph G and its minimal spanning tree T . Consequently, we can compute a hierarchy\nof high-density regions in two steps: First, construct G and its minimal spanning tree T . Then,\nremove edges from T in descending order of their weights such that the connected components left\nafter removing an edge with weight \u03b1 correspond to a high density cluster at level \u03b1. Connected\ncomponents with a single feature vector are treated as outliers and removed.\n\n1We calculated \u03b1i,s in G by checking the \u03b1(x) values for 20 points sampled from the line segment between\n\nxi and xs. The same approach was also used by [2] and [23].\n\n4\n\n\f3.3 Constructing a Cluster Tree\n\nThe hierarchy resulting from the procedure described above does not form a full partition of the\ndata, as in each edge removal step a fraction of the data is left outside the newly formed high density\nclusters. To construct a full partition, feature vectors left outside at each step are assigned to their\nnearest cluster. Additionally, when a cluster is split into sub-clusters, all its assigned feature vectors\nare assigned to one of the new sub-clusters.\nThe choice of kernel width has a strong effect on the resulting cluster tree. On the one hand, a large\nbandwidth may lead to the inner products induced by the kernel function being constant; that is,\nmany examples in the train data are projected to the same point in F. Hence, the approximated\nMV-sets could eventually be equal, resulting in a cluster tree with a single node. On the other hand,\na small bandwidth may lead to the inner products becoming closer to zero; that is, points in F tend\nto lie on orthogonal axes, resulting in a cluster tree with many branches and leaves.\nWe believe that the best approach for choosing the correct bandwidth is based on the number of\nmodes that we expect to \ufb01nd for the density function. By using a grid search over possible \u03b3 values,\nwe can choose the bandwidth that results in a cluster tree in which the expected number of modes is\nthe same as the number we expect.\n\n4 Empirical Analysis\n\nWe evaluate our hierarchical clustering method on synthetic and real data. While the quality of\nan estimated cluster tree for the synthetic data can be evaluated by comparing the resulting tree\nwith the true modal structure of the density, alternative quality measures are required to estimate the\nef\ufb01ciency of hierarchical clustering methods on high-dimensional data when the density is unknown.\nIn the following section we introduce our proposed measure.\n\n4.1 The Quality Measure\n\nOne prominent measure is the F -measure, which was extended by [16] to evaluate the quality of\nestimated cluster trees. Recall that classes refer to the true (unobserved) class assignment of the\nobserved vectors, whereas clusters refer to their tree-assigned partition. For a cluster j and class i,\nde\ufb01ne ni,j as the number of feature vectors of class i in cluster j, and ni, nj as the number of feature\nvectors associated with class i and with cluster j, respectively. The F -measure for cluster j and class\ni is given by Fi,j = 2\u2217Recalli,j\u2217P recisioni,j\n. The\nF -measure for the cluster tree is\n\nand P recisioni,j = ni,j\nnj\n\n, where Recalli,j = ni,j\nni\n\nRecalli,j +P recisioni,j\n\n(cid:88)\n\ni\n\nF =\n\nni\nn\n\nmax\n\nj\n\n{Fi,j}.\n\n(4)\n\nThe F -measure was found to be a useful tool for the evaluation of hierarchical clustering meth-\nods [21], as it quanti\ufb01es how well we could extract k clusters, one for each class, that are relatively\n\u201cpure\u201d and large enough with respect to their associated class. However, we found it dif\ufb01cult to\nuse this measure directly in our analysis, because it appears to prefer over\ufb01tted trees, with a large\nnumber of spurious clusters.\nWe suggest correcting this bias via cross-validation. We split the data X into two equal-sized train\nand test sets, and construct a tree using the train set. Test examples are recursively assigned to\nclusters in the tree in a top-down manner, and the F -measure is calculated according to the resulting\ntree. When analytical boundaries of clusters in the tree are not available (such as in our method), we\nrecursively assign each test example in a cluster to the sub-cluster containing its nearest neighbor in\nthe train set, using Euclidean distance.\n\n4.2 Reference Methods\n\nWe compare our method with methods for density estimation, that can also be used to construct a\ngraph G. For this purpose, since f (x) is used instead of \u03b1(x), we had to adjust the way we construct\n\n5\n\n\fG and T 2. A kernel density estimator (KDE) and nearest neighbor density estimator (NNE), similar\nto the one used by [23], are used as competing methods. In addition, we compare our method with\nthe bisecting K-means (BiKMean) method [21] for hierarchical clustering.\n\n4.3 Experiments with Synthetic Data\n\nWe run our hierarchical clustering method on data sampled from a synthetic, two-dimensional,\ntrimodal distribution. This distribution is de\ufb01ned by a 3-Gaussian mixture distribution. 20 i.i.d.\npoints were sampled for training our q-OCSVM method, with \u03b11 = 0.25, \u03b12 = 0.5, \u03b13 = 0.75\n(3-quantiles), and with a bandwidth \u03b3, which results in a cluster tree with 3 modes. The left side\nof Figure 2 shows the data sampled, and the 3 approximated hierarchical MV-sets. The resulting\n3-modes cluster tree is shown in the right side of Figure 2.\n\nFigure 2: Left: Data sampled for training our q-OCSVM method and the 3 approximated MV-sets;\nRight: The cluster tree estimated from the synthetic data. The most frequent label in each mode,\ndenoted in curly brackets next to each leaf, de\ufb01nes the label of the mode. Branches are labeled with\nthe probability mass associated with their level set.\n.We used our proposed and reference method on the data to obtain cluster trees with different numbers\nof modes (leaves). The number of modes can be tweaked by changing the value of \u03b3 for the q-\nOCSVM and KDE methods, and by pruning nodes of small size for the NNE and BiKMean methods.\n20 test examples were i.i.d. sampled from the same distribution to estimate the resulting F -measures.\nThe left side of Figure 3 shows the F -measure for each method in terms of changes in the number of\nmodes in the resulting tree. For all methods, the F -measure is bounded by 0.8 as long as the number\nof modes is greater than 3, correctly suggesting the presence of 3 modes for the data.\n\n4.4 The olive oil dataset\n\nThe olive oil dataset [10] consists of 572 olive oil examples, with 8 features each, from 3 regions in\nItaly (R1, R2, R3), each one further divided into 3 sub-areas. The right side of Figure 3 shows the\nF -measure for each method in terms of changes in the number of modes in the tree. The q-OCSVM\nmethod dominates the other three methods when the number of modes is higher than 5, with an\naverage F = 0.62, while its best competitor (KDE) has an average F = 0.55.\nIt can be seen that the variability of the F -measure plots is higher for the q-OCSVM and KDE meth-\nods than for the BiKMeans and NNE methods. This is a consequence of the fact that the structure of\nunpruned nodes remains the same for the BiKMeans and NNE methods, whereas different \u03b3 values\nmay lead to different tree structures for the q-OCSVM and KDE methods.\nThe cluster trees estimated using the q-OCSVM and KDE methods are shown in Figure 4. For\neach method, we chose to show the cluster tree with the smallest number of modes with leaves\ncorresponding to all 8 labels. The q-OCSVM method groups leaves associated with the 8 areas into\n3 clusters, which perfectly corresponds to the hierarchical structure of the labels. In contrast, modes\nestimated using the KDE method cannot be grouped into 3 homogeneous clusters.\n\n2When a density estimator f is used, pi,s = mint\u2208[0,1] p(tf (xi) + (1 \u2212 t)f (xs)) are set to be the edge\nweights, G(c) is induced by removing edges from G with pi,s < c, and T is de\ufb01ned as the maximal spanning\ntree of G (instead of the minimal).\n\n6\n\n\u22122.5\u22122\u22121.5\u22121\u22120.500.511.522.5\u22122.5\u22122\u22121.5\u22121\u22120.500.511.522.5Q=3,N=20,\u03b3=15\u22120.500.511.52Branch 4: {1 2}, P=0.68Branch 5: {1 2 3}, P=0.85Leaf 1: {1}Leaf 2: {2}Leaf 3: {3}\fFigure 3: Left: The F -measures of each method are plotted in terms of the number of modes in\nthe estimated cluster trees. The F -measures are calculated using the synthetic test data; Right: F -\nmeasure for the olive oil dataset, calculated using 286 test examples, is shown in terms of the number\nof modes in the cluster tree.\n\nFigure 4: Left: Cluster tree for the olive oil data estimated with q-OCSVM; Right: Cluster tree for\nthe olive oil data estimated with KDE.\nOne prominent advantage of our method is that we can use the estimated probability mass of\nbranches in the tree to better understand the modal structure of the data. For instance, we can\nlearn from Figure 4 that the R2 cluster is found in a relatively sparse MV-set at level 0.89, while\nits two nodes are found in a much denser MV-set at level 0.12. Probability masses for high density\nclusters can also be estimated using the KDE method, but unlike our method, theoretical guarantees\nare not provided.\n\n4.5 The 1000 genomes dataset\n\nWe have also evaluated our method on the 1000 genomes dataset [24]. Hierarchical clustering\napproaches naturally arise in genetic population studies, as they can reconstruct trees that describe\nevolutionary history and are often the \ufb01rst step in evolutionary studies [12]. The reconstruction of\npopulation structure is also crucial for genetic mapping studies, which search for genetic factors\nunderlying genetic diseases.\nIn this experiment we evaluated our method\u2019s capability to reconstruct the evolutionary history of\npopulations represented in the 1000 genomes dataset, which consists of whole genome sequences\nof 1, 092 human individuals from 14 distinct populations. We used a trinary representation wherein\neach individual is represented as a vector of features corresponding to 0,1 or 2. Every feature rep-\nresents a known genetic variation (with respect to the standard human reference genome 3), where\nthe number indicates the number of varied genome copies. We used data processed by the 1000\nGenomes Consortium, which initially contained 2.25 million variations. To reduce dimensionality,\nwe used the 1, 000 features that had the highest information gain with respect to the populations. We\nexcluded from the analysis highly genetically admixed populations (Colombian, Mexican and Puerto\n\n3http://genomereference.org\n\n7\n\n123450.50.550.60.650.70.750.80.850.9CC vs. FNumber of modesF\u2212Measure qOCSVMKDEBiKMeansNNE024681012141618200.30.350.40.450.50.550.60.65CC vs. FNumber of modesF\u2212Measure qOCSVMKDEBiKMeansNNEq-OCSVM Cluster TreeKDE Cluster TreeR1R2R3R2R1R3R1R3R1R1q-OCSVM Cluster TreeKDE Cluster TreeR1R2R3R2R1R3R1R3R1R1\fRican ancestry), because the evolutionary history of admixed populations cannot be represented by\na tree. After exclusion, 911 individuals remained in the analysis.\n\nFigure 5: Left: F -measure for the 1000 genomes dataset, calculated using 455 test examples; Right:\nCluster tree for the 1000 genomes data estimated with q-OCSVM. The labels are GBR (British\nin England and Scotland), TSI (Toscani in Italia), CEU (Utah Residents with Northern and Western\nEuropean ancestry), FIN (Finnish in Finland), CHB (Han Chinese in Bejing, China), CHS (Southern\nHan Chinese), ASW (Americans of African Ancestry in SW USA), YRI (Yoruba in Ibadan, Nigera),\nand LWK (Luhya in Webuye, Kenya).\nThe left side of Figure 5 shows that q-OCSVM dominates the other methods for every number of\nmodes tested, demonstrating its superiority in high dimensional settings. Namely, it achieves an\nF -measure of 0.4 for >2 modes, whereas competing methods obtain an F -measure of 0.35. KDE\nwas not evaluated as it is not applicable due to the high data dimensionality.\nTo obtain a meaningful tree, we increased the number of modes until leaves corresponding to all\nthree major human population groups (African, East Asian and European) represented in the dataset\nappeared. The tree obtained by using 28 modes is shown in the right side of Figure 5, indicating that\nq-OCSVM clustering successfully distinguishes between these three population groups. Addition-\nally, it corresponds with the well-established theory that a divergence of a single ancestral population\ninto African and Eurasian populations took place in the distant past, and that Eurasians diverged into\nEast Asian and European populations at a later time [12]. The larger number of leaves representing\nEuropean populations may result from the larger number of European individuals and populations\nin the 1000 genomes dataset.\n\n5 Discussion\n\nIn this research we use the q-OCSVM method as a plug-in method for hierarchical clustering in high-\ndimensional distributions. The q-OCSVM method estimates the level sets (MV-sets) directly without\na density estimation step. Therefore, we expect to achieve more accurate results than approaches\nbased on density estimation. Furthermore, since we know \u03b1 for each approximated MV-set, we\nbelieve our solution would be more interpretable and informative than a solution provided by a\ndensity estimation-based method.\n\nReferences\n[1] Mihael Ankerst, Markus M Breunig, Hans-Peter Kriegel, and J\u00a8org Sander. Optics: ordering\n\npoints to identify the clustering structure. ACM SIGMOD Record, 28(2):49\u201360, 1999.\n\n[2] Asa Ben-Hur, David Horn, Hava T Siegelmann, and Vladimir Vapnik. Support vector cluster-\n\ning. The Journal of Machine Learning Research, 2:125\u2013137, 2002.\n\n[3] G\u00b4erard Biau, Beno\u02c6\u0131t Cadre, and Bruno Pelletier. A graph-based estimator of the number of\n\nclusters. ESAIM: Probability and Statistics, 11(1):272\u2013280, 2007.\n\n8\n\n51015202500.050.10.150.20.250.30.350.40.450.5CC vs. FNumber of modesF\u2212Measure qOCSVMBiKMeansSLq-OCSVM Cluster TreeKDE Cluster TreeR1R2R3R2R1R3R1R3R1R1East AsianAfricanEuropean\f[4] Gunnar Carlsson and Facundo M\u00b4emoli. Characterization, stability and convergence of hierar-\nchical clustering methods. The Journal of Machine Learning Research, 99:1425\u20131470, 2010.\n[5] Gunnar Carlsson and Facundo M\u00b4emoli. Multiparameter hierarchical clustering methods. In\n\nClassi\ufb01cation as a Tool for Research, pages 63\u201370. Springer, 2010.\n\n[6] Antonio Cuevas, Manuel Febrero, and Ricardo Fraiman. Cluster analysis: a further approach\nbased on density estimation. Computational Statistics & Data Analysis, 36(4):441\u2013459, 2001.\n[7] Sanjoy Dasgupta. Two faces of active learning. Theoretical Computer Science, 412(19):1767\u2013\n\n1781, 2011.\n\n[8] Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. In ICML, pages\n\n208\u2013215. ACM, 2008.\n\n[9] Martin Ester, Hans-Peter Kriegel, J\u00a8org Sander, and Xiaowei Xu. A density-based algorithm for\ndiscovering clusters in large spatial databases with noise. In KDD, volume 96, pages 226\u2013231,\n1996.\n\n[10] M Forina, C Armanino, S Lanteri, and E Tiscornia. Classi\ufb01cation of olive oils from their fatty\n\nacid composition. Food Research and Data Analysis, pages 189\u2013214, 1983.\n\n[11] Assaf Glazer, Michael Lindenbaoum, and Shaul Markovitch. q-ocsvm: A q-quantile estimator\nfor high-dimensional distributions. In Advances in Neural Information Processing Systems,\npages 503\u2013511, 2013.\n\n[12] I. Gronau, M. J. Hubisz, et al. Bayesian inference of ancient human demography from individ-\n\nual genome sequences. Nature Genetics, 43(10):1031\u20131034, Oct 2011.\n\n[13] John A Hartigan. Clustering Algorithms. John Wiley & Sons, Inc., New York, 1975.\n[14] Anil K Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):\n\n651\u2013666, 2010.\n\n[15] Daphne Koller and Mehran Sahami. Hierarchically classifying documents using very few\n\nwords. In ICML, pages 170\u2013178. Morgan Kaufmann Publishers Inc., 1997.\n\n[16] Bjornar Larsen and Chinatsu Aone. Fast and effective text mining using linear-time document\n\nclustering. In SIGKDD, ACM, pages 16\u201322, 1999.\n\u00b4Alvaro Mart\u00b4\u0131nez-P\u00b4erez. A density-sensitive hierarchical clustering method. arXiv preprint\narXiv:1210.6292, 2012.\n\n[17]\n\n[18] Philippe Rigollet and R\u00b4egis Vert. Optimal rates for plug-in estimators of density level sets.\n\nBernoulli, 15(4):1154\u20131178, 2009.\n\n[19] Alessandro Rinaldo, Aarti Singh, Rebecca Nugent, and Larry Wasserman. Stability of density-\n\nbased clustering. Journal of Machine Learning Research, 13:905\u2013948, 2012.\n\n[20] Bernhard Sch\u00a8olkopf, John C. Platt, John C. Shawe-Taylor, Alex J. Smola, and Robert C.\nWilliamson. Estimating the support of a high-dimensional distribution. Neural Computation,\n13(7):1443\u20131471, 2001.\n\n[21] Michael Steinbach, George Karypis, and Vipin Kumar. A comparison of document clustering\n\ntechniques. In KDD Workshop on Text Mining, 2000.\n\n[22] Werner Stuetzle. Estimating the cluster tree of a density by analyzing the minimal spanning\n\ntree of a sample. Journal of Classi\ufb01cation, 20(1):025\u2013047, 2003.\n\n[23] Werner Stuetzle and Rebecca Nugent. A generalized single linkage method for estimating the\n\ncluster tree of a density. Journal of Computational and Graphical Statistics, 19(2), 2010.\n\n[24] The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092\n\nhuman genomes. Nature, 491:1, 2012.\n\n[25] Ruth Urner, Sharon Wulff, and Shai Ben-David. Plal: Cluster-based active learning. In COLT,\n\npages 1\u201322, 2013.\n\n[26] G. Walther. Granulometric smoothing. The Annals of Statistics, pages 2273\u20132299, 1997.\n[27] David Wishart. Mode analysis: A generalization of nearest neighbor which reduces chaining\n\neffects. Numerical Taxonomy, 76:282\u2013311, 1969.\n\n[28] Rui Xu, Donald Wunsch, et al. Survey of clustering algorithms. IEEE Transactions on Neural\n\nNetworks, 16(3):645\u2013678, 2005.\n\n9\n\n\f", "award": [], "sourceid": 611, "authors": [{"given_name": "Assaf", "family_name": "Glazer", "institution": "Technion"}, {"given_name": "Omer", "family_name": "Weissbrod", "institution": "Technion"}, {"given_name": "Michael", "family_name": "Lindenbaum", "institution": "Technion"}, {"given_name": "Shaul", "family_name": "Markovitch", "institution": "Technion"}]}