{"title": "Mode Estimation for High Dimensional Discrete Tree Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1323, "page_last": 1331, "abstract": "This paper studies the following problem: given samples from a high dimensional discrete distribution, we want to estimate the leading $(\\delta,\\rho)$-modes of the underlying distributions. A point is defined to be a $(\\delta,\\rho)$-mode if it is a local optimum of the density within a $\\delta$-neighborhood under metric $\\rho$. As we increase the ``scale'' parameter $\\delta$, the neighborhood size increases and the total number of modes monotonically decreases. The sequence of the $(\\delta,\\rho)$-modes reveal intrinsic topographical information of the underlying distributions. Though the mode finding problem is generally intractable in high dimensions, this paper unveils that, if the distribution can be approximated well by a tree graphical model, mode characterization is significantly easier. An efficient algorithm with provable theoretical guarantees is proposed and is applied to applications like data analysis and multiple predictions.", "full_text": "Mode Estimation for High Dimensional Discrete Tree\n\nGraphical Models\n\nChao Chen\n\nDepartment of Computer Science\n\nRutgers, The State University of New Jersey\n\nPiscataway, NJ 08854-8019\n\nchao.chen.cchen@gmail.com\n\nHan Liu\n\nDepartment of Operations Research\n\nand Financial Engineering\n\nPrinceton University, Princeton, NJ 08544\n\nhanliu@princeton.edu\n\nDimitris N. Metaxas\n\nDepartment of Computer Science\n\nRutgers, The State University of New Jersey\n\nPiscataway, NJ 08854-8019\ndnm@cs.rutgers.edu\n\nTianqi Zhao\n\nDepartment of Operations Research\n\nand Financial Engineering\n\nPrinceton University, Princeton, NJ 08544\n\ntianqi@princeton.edu\n\nAbstract\n\nThis paper studies the following problem: given samples from a high dimensional\ndiscrete distribution, we want to estimate the leading (\u03b4, \u03c1)-modes of the under-\nlying distributions. A point is de\ufb01ned to be a (\u03b4, \u03c1)-mode if it is a local opti-\nmum of the density within a \u03b4-neighborhood under metric \u03c1. As we increase\nthe \u201cscale\u201d parameter \u03b4, the neighborhood size increases and the total number of\nmodes monotonically decreases. The sequence of the (\u03b4, \u03c1)-modes reveal intrin-\nsic topographical information of the underlying distributions. Though the mode\n\ufb01nding problem is generally intractable in high dimensions, this paper unveils\nthat, if the distribution can be approximated well by a tree graphical model, mode\ncharacterization is signi\ufb01cantly easier. An ef\ufb01cient algorithm with provable theo-\nretical guarantees is proposed and is applied to applications like data analysis and\nmultiple predictions.\n\nIntroduction\n\n1\nBig Data challenge modern data analysis in terms of large dimension, insuf\ufb01cient sample and the\ninhomogeneity. To handle these challenges, new methods for visualizing and exploring complex\ndatasets are crucially needed. In this paper, we develop a new method for computing diverse modes\nof the unknown discrete distribution function. Our method is applicable in many \ufb01elds, such as\ncomputational biology, computer vision, etc. More speci\ufb01cally, our method aims to \ufb01nd a sequence\nof (\u03b4, \u03c1)-modes, which are de\ufb01ned as follows:\nDe\ufb01nition 1 ((\u03b4, \u03c1)-modes). A point is a (\u03b4, \u03c1)-mode if and only if its probability is higher than all\npoints within distance \u03b4 under a distance metric \u03c1.\nWith a metric \u03c1(\u00b7) given, the \u03b4-neighborhood of a point x, N\u03b4(x), is de\ufb01ned as the ball centered\nat x with radius \u03b4. Varying \u03b4 from small to large, we can examine the topology of the underlying\ndistribution at different scales. Therefore \u03b4 is also called the scale parameter. When \u03b4 = 0, N\u03b4(x) =\n{x}, so every point is a mode. When \u03b4 = \u221e, N\u03b4(x) is the whole domain, denoted by X , so the\nmaximum a posteriori is the only mode. As \u03b4 increases from zero to in\ufb01nity, the \u03b4-neighborhood of x\nmonotonically grows and the set of modes, denoted by M\u03b4, monotonically decreases. Therefore as \u03b4\nincreases, the sets of M\u03b4 form a nested sequence, which can be viewed as a multi-scale description\nof the underlying probability landscape. See Figure 1 for an illustrative example. In this paper,\nwe will use the Hamming distance, \u03c1H, i.e., the number of variables at which two points disagree.\nOther distance metrics, e.g., the L2 distance \u03c1L2(x, x(cid:48)) = (cid:107)x \u2212 x(cid:48)(cid:107)2, are also possible but with more\ncomputational challenges.\n\nThe concept of modes can be justi\ufb01ed by many practical problems. We mention the following\ntwo motivating applications: (1) Data analysis: modes of multiple scales provide a comprehensive\n\n1\n\n\fgeometric description of the topography of the underlying distribution.\nIn the low-dimensional\ncontinuous domain, such tools have been proposed and used for statistical data analysis [20, 17, 3].\nOne of our goals is to carry these tools to the discrete and high dimensional setting. (2) Multiple\npredictions: in applications such as computational biology [9] and computer vision [2, 6], instead of\none, a model generates multiple predictions. These predictions are expected to have not only high\nprobability but also high diversity. These solutions are valid hypotheses which could be useful in\nother modules down the pipeline. In this paper we address the computation of modes, formally,\nProblem 1 (M-modes). For all \u03b4\u2019s, compute the M modes with the highest probabilities in M\u03b4.\n\nThis problem is challenging. In the continuous setting, one often starts from random positions,\nestimates the gradient of the distribution and walks along it towards the nearby mode [8]. However,\nthis gradient-ascent approach is limited to low-dimensional distributions over continuous domains.\nIn discrete domains, gradients are not de\ufb01ned. Moreover, a naive exhaustive search is computa-\ntionally infeasible as the total number of points is exponential to dimension. In fact, even deciding\nwhether a given point is a mode is expensive as the neighborhood has exponential size.\n\nIn this paper, we propose a new approach to compute these discrete (\u03b4, \u03c1)-modes. We show that\nthe problem becomes computationally tractable when we restrict to distributions with tree factor\nstructures. We explore the structure of the tree graphs and devise a new algorithm to compute\nthe top M modes of a tree-structured graphical model. Inspired by the observation that a global\nmode is also a mode within smaller subgraphs, we show that all global modes can be discovered\nby examining all local modes and their consistent combinations. Our algorithm \ufb01rst computes local\nmodes, and then computes the high probability combinations of these local modes using a junction\ntree approach. We emphasize that the algorithm itself can be used in many graphical model based\nmethods, such as conditional random \ufb01eld [10], structured SVM [22], etc.\n\nWhen the distribution is not expressed as a factor graph, we will \ufb01rst estimate the tree-structured\nfactor graph using the algorithm of Liu et al. [13]. Experimental results demonstrate the accuracy\nand ef\ufb01ciency of our algorithm. More theoretical guarantee of our algorithm can be found in [7].\n\nRelated work. Modes of distributions have been studied in continuous settings. Silverman [21]\ndevised a test of the null hypothesis of whether a kernel density estimation has a certain number\nof modes or less. Modes can be used in clustering [8, 11]. For each data point, a monotonically\nincreasing path is computed using a gradient-ascend method. All data points whose gradient path\nconverge to a same mode is labeled the same class. Modes can be also used to help decide the\nnumber of mixture components in a mixture model, for example as the initialization of the maximum\nlikelihood estimation [11, 15]. The topographical landscape of distributions has been studied and\nused in characterizing topological properties of the data [4, 20, 17]. Most of these approaches\nassume a kernel density estimation model. Modes are detected by approximating the gradient using\nk-nearest neighbors. This approach is known to be inaccurate for high dimensional data.\n\nWe emphasize that the multi-scale view of a function has been used broadly in compute vision.\nBy convolving an image with a Gaussian kernel of different widths, we obtain different level of\ndetails. This theory, called the scale-space theory [25, 12], is used as the fundamental principle\nof most state-of-the-art image feature extraction techniques [14, 16]. This multi-scale view has\nbeen used in statistical data analysis by Chaudhuri and Marron [3]. Chen and Edelsbrunner [5]\nquantitatively measured the topographical landscape of an image at different scales.\n\nChen et al. [6] proposed a method to compute modes of a simple chain model. However, restrict-\ning to a simple chain will limit our mode prediction accuracy. A simple chain model has much less\n\ufb02exibility than tree-factored models. Even if the distribution has a chain structure, recovering the\nchain from data is computationally intractable: the problem requires \ufb01nding the chain with maximal\ntotal mutual information, and thus is equivalent to the NP-hard travelling salesman problem.\n\n\u03b4 = 0\n\n\u03b4 = 1\n\n\u03b4 = 1\n\n\u03b4 = 4\n\n\u03b4 = 4\n\n\u03b4 = 7\n\nFigure 1: An illustration of modes of different scales. Each vertical bar corresponds to an element. The height\ncorresponds to its probability. Left: when \u03b4 = 1, there are three modes (red). Middle: when \u03b4 = 4, only two\nmodes left. Right: the multi-scale view of the landscape.\n\n2\n\nP(x)P(x)=1=4=7=0=1=4=7=0\f2 Background\nGraphical models. We brie\ufb02y introduce graphical models. Please refer to [23, 19] for more details.\nThe graphical model is a powerful tool to model the joint distribution of a set of interdependent\nrandom variables. The distribution is encrypted in a graph G = (V,E) and a potential function f.\nThe set of vertices/nodes V corresponds to the set of discrete variables i \u2208 [1, D], where D = |V|.\nA node i can be assigned a label xi \u2208 L. A label con\ufb01guration of all variables x = (x1, . . . , xD)\nis called a labeling. We denote by X = LD the domain of all labelings. The potential function\nf : X \u2192 R assigns to each labeling a real value, which is inversely proportional to the logarithm\nx\u2208X exp(\u2212f (x)) is the\nlog-partition function. Thus the maximal modes of the distribution and the minimal modes of f have\na one-to-one correspondence. Assuming these variables satisfy the Markov properties, the potential\nfunction can be written as\n\nof the probability distribution, p(x) = exp(\u2212f (x) \u2212 A), where A = log(cid:80)\n\n(i,j)\u2208E fi,j(xi, xj),\n\n(2.1)\nwhere fi,j : L \u00d7 L \u2192 R is the potential function for edge (i, j) 1. For convenience, we assume any\ntwo different labelings have different potential function values.\nWe de\ufb01ne the following notations for convenience. A vertex subset, V(cid:48) \u2286 V, induces a subgraph\nconsisting of V(cid:48) together with all edges whose both ends are within V(cid:48). In this paper, all subgraphs\nare vertex-induced. Therefore, we abuse the notation and denote both the subgraph and the vertex\nsubset by the same symbol.\n\nWe call a labeling of a subgraph B a partial labeling. For a given labeling y, we may denote\nby yB its label con\ufb01gurations of vertices of B. We say the distance between two partial labelings\nxB and yB(cid:48) is equal to the Hamming distance between the two within the intersection of the two\nsubgraphs \u02c6B = B \u2229 B(cid:48), formally, \u03c1(xB, yB(cid:48)) = \u03c1(x \u02c6B, y \u02c6B). We denote by fB(yB) the potential\nof the partial labeling, which is only evaluated over edges within B. When the context is clear, we\ndrop the subscript B and write f (yB).\n\nTree density estimation. In this paper, we focus on tree-structured graphical models. A distri-\n\nf (x) =(cid:80)\n\nbution that is Markov to a tree structure has the following factorization:\n\nP (X = x) = p(x) =(cid:81)\n\n(cid:81)\n\n(i,j)\u2208E\n\np(xi, xj)\np(xi)p(xj)\n\nk\u2208V p(xk).\n\n(2.2)\n\nIt is easy to see that the potential function can be written in the form (2.1). In the case when the\ninput is a set of samples, we will \ufb01rst use the tree density estimation algorithm [13] to estimate\nthe graphical model. The oracle tree distribution is the one on the space of all tree distributions\nthat minimizes the Kullback-Leibler (KL) divergence between itself and the tree density, that is,\nq\u2217 = argminq\u2208PT D(p\u2217||q), where PT is the family of distributions supported on a tree graph, p\u2217 is\nx\u2208X p(x)(log p(x) \u2212 log q(x)) is the KL divergence. It is proved\n[1] that q\u2217 has the same marginal univariate and bivariate distribution as p\u2217. Hence to recover q\u2217, we\nonly need to recover the structure of the tree. Denote by E\u2217 the edge set of the oracle tree. Simple\n\nthe true density, and D(p||q) =(cid:80)\ncalculation shows that D(p\u2217||q\u2217) = \u2212(cid:80)\n(cid:80)L\nxj =1p\u2217(xi, xj)(log p\u2217(xi, xj) \u2212 log p\u2217(xi) \u2212 log p\u2217(xj))\nis called the mutual information between node i and j. Therefore we can apply Kruskal\u2019s maximum\nspanning tree algorithm to obtain E\u2217, with edge weights being the mutual information.\n\nIij =(cid:80)L\n\n(i,j)\u2208E\u2217 Iij + const, where\n\nin (2.3) with their estimates \u02c6p(xi, xj) = 1\nn\n1\nn\n\ncompute estimators \u02c6Iij from the data set (cid:8)X (1), . . . , X (n)(cid:9) by replacing p\u2217(xi, xj) and p\u2217(xi)\n(cid:80)n\nmarginal univariate and bivariate distributions. By (2.1), we have \u02c6f (x) =(cid:80)\n\nBy de\ufb01nition, the potential function on each edge can be estimated similarly using the estimated\n\u02c6fi,j(xi, xj),\n\ni = xi}. The tree estimator is thus obtained by Kruskal\u2019s algorithm:\n\nIn reality, we do not know the true marginal univariate and bivariate distribution. We thus\n\nj = xj} and \u02c6p(xi) =\n\n(cid:80)n\n(cid:80)\n\ni = xi, X (s)\n\n\u02c6Tn = argmaxT\n\n(i,j)\u2208E(T )\n\n\u02c6Iij.\n\n1{X (s)\n\ns=1\n\n1{X (s)\n\ns=1\n\n(2.3)\n\n(2.4)\n\nxi=1\n\n(i,j)\u2208E( \u02c6T )\n\nwhere \u02c6T is the estimated tree using Kruskal\u2019s algorithm.\n\n1For convenience, we drop unary potentials fi in this paper. Note that any potential function with unary\n\npotentials can be rewritten as a potential function without them.\n\n3\n\n\fFigure 2: Left: The junction tree with radius r = 2. We show the geodesic balls of three supernodes. In each\ngeodesic ball, the center is red. The boundary vertices are blue. The interior vertices are black and red.\nRight-bottom: Candidates of a geodesic ball. Each column corresponds to candidates of one boundary labeling.\nSolid and empty vertices represent label zero and one. Right-top: A geodesic ball with radius r = 3.\n3 Method\nWe present the \ufb01rst algorithm to compute M\u03b4 for a tree-structured graph. To compute modes of all\nscales, we go through \u03b4\u2019s from small to large. The iteration stops at a \u03b4 with only a single mode.\n\nWe \ufb01rst present a polynomial algorithm for the veri\ufb01cation problem: deciding whether a given\nlabeling is a mode (Sec. 3.1). However, this algorithm is insuf\ufb01cient for computing the top M modes\nbecause the space of labelings is exponential size. To compute global modes, we decompose the\nproblem into computing modes of smaller subgraphs, which are called local modes. Because of the\nbounded subgraph size, local modes can be solved ef\ufb01ciently. In Sec. 3.2, we study the relationship\nbetween global and local modes. In Sec. 3.3 and Sec. 3.4, we give two different methods to compute\nlocal modes, depending on different situations.\n3.1 Verifying whether a labeling is a mode\nTo verify whether a given labeling y is a mode, we check whether there is another labeling within\nN\u03b4(y) with a smaller potential. We compute the labeling within the neighborhood with the minimal\npotential, y\u2217 = argminz\u2208N\u03b4(y) f (z). The given labeling y is a mode if and only if y\u2217 = y.\n\nMSGi\u2192j((cid:96)i, \u03c4 ) =\n\nWe present a message-passing algorithm. We select an arbitrary node as the root, and thus a\ncorresponding child-parent relationship between any two adjacent nodes. We compute messages\nfrom leaves to the root. Denote by Tj as the subtree rooted at node j. The message from vertex i\nto j, MSGi\u2192j((cid:96)i, \u03c4 ) is the minimal potential one can achieve within the subtree Ti given a \ufb01xed\nlabel (cid:96)i at i and a constraint that the partial labeling of the subtree is no more than \u03c4 away from y.\nFormally,\nwhere (cid:96)i \u2208 L and \u03c4 \u2208 [0, \u03b4]. This message cannot be computed until the messages from all children\nof i have been computed. For ease of exposition, we add a pseudo vertex s as the parent of the root, r.\nBy de\ufb01nition, min(cid:96)r MSGr\u2192s((cid:96)r, \u03b4) is the potential of the desired labeling, y\u2217. Using the standard\nbacktracking strategy of message passing, we can recover y\u2217. Please refer to [7] for details of the\ncomputation of each individual message. For convenience we call this procedure Is-a-Mode. This\nprocedure and its variations will be used later.\n3.2 Local and global modes\nGiven a graph G and a collection of its subgraphs B, we show that under certain conditions, there\nis a tight connection between the modes of these subgraphs and the modes of G. In particular, any\nconsistent combinations of these local modes is a global mode, and vice versa.\n\nzTi :zi=(cid:96)i,\u03c1(zTi ,y)\u2264\u03c4\n\nmin\n\nf (zTi)\n\nSimply considering the modes of a subgraph B is insuf\ufb01cient. A mode of B with small potential\nmay cause big penalty when it is extended to a labeling of the whole graph. Therefore, when\nde\ufb01ning a local mode, we select a boundary region of the subgraph and consider all possible label\ncon\ufb01gurations of this boundary region. Formally, we divide the vertex set of B into two disjoint\nsubsets, the boundary \u2202B and the interior int(B), so that any path connecting an interior vertex\nu \u2208 int(B) and an outside vertex v /\u2208 B has to pass through at least one boundary vertex w \u2208 \u2202B.\nSee Figure 2(left) for examples of B. Similar to the de\ufb01nition of a global mode, we de\ufb01ne a local\nmode as the partial labeling with the smallest potential in its \u03b4-neighborhood:\nDe\ufb01nition 2 (local modes). A partial labeling, xB, is a local mode w.r.t. \u03b4-neighborhood if and only\nif there is no other partial labeling yB which (C1) has a smaller potential, f (yB) < f (xB); (C2) is\nwithin \u03b4 distance from xB, \u03c1(yB, xB) \u2264 \u03b4 and (C3) has the same boundary labeling, y\u2202B = x\u2202B.\n\n4\n\nc\fWe denote by M\u03b4\n\nB the space of local modes of the subgraph B. Given a set of subgraphs B\ntogether with a interior-boundary decomposition for each subgraph, we have the following theorem.\nTheorem 3.1 (local-global). Suppose any connected subgraph G(cid:48) \u2286 G of size \u03b4 is contained within\nint(B) of some B \u2208 B. A labeling x of G is a global mode if and only if for every B \u2208 B, the\ncorresponding partial labeling xB is a local mode.\nProof. The necessity is obvious since a global mode is a local mode within every subgraph. Note\nthat necessity is not true any more if the restriction on \u2202B (C3 in De\ufb01nition 2) is relaxed. Next we\nshow the suf\ufb01ciency by contradiction. Suppose a labeling x is a local mode within every subgraph,\nbut is not a global mode. By de\ufb01nition, there is y \u2208 N\u03b4(x) with smaller potential than x. We assume\ny and x disagree within a connected subgraph. If y and x disagree within multiple connected compo-\nnents, we can always \ufb01nd y(cid:48) \u2208 N\u03b4(x) with smaller potential which disagree with x within only one\nof these connected components. The subgraph on which x and y disagree must be contained by the\ninterior of some B \u2208 B. Thus xB is not a local mode due to the existence of yB. Contradiction.\n\nWe say partial labelings of two different subgraphs are consistent if they agree at all common\nvertices. Theorem 3.1 shows that there is a bijection between the set of global modes and the set of\nconsistent combinations of local modes. This enables us to compute global modes by \ufb01rst compute\nlocal modes of each subgraph and then search through all their consistent combinations.\nInstantiating for a tree-structured graph. For a tree-structured graph with D nodes, let B be\nthe set of D geodesic balls, centered at the D nodes. Each ball has radius r = (cid:98) \u03b4\n2(cid:99) + 1. Formally,\nwe have Bi = {j | dist(i, j) \u2264 r}, \u2202Bi = {j | dist(i, j) = r}, and int(Bi) = {j | dist(i, j) < r}.\nHere dist(i, j) is the number of edges between the two nodes. See Figure 2(left) for examples. It\nis not hard to see that any size \u03b4 subtree is contained within a int(Bi) for some i. Therefore, the\nprerequisite of Theorem 3.1 is guaranteed.\n\nWe construct a junction tree to combine the set of all consistent local modes. It is constructed\nas follows: Each supernode of the junction tree corresponds to a geodesic ball. Two supernodes are\nneighbors if and only if their centers are neighbors in the original tree. See Figure 2(left). Let the\nlabel set of a supernode be its corresponding local modes, as de\ufb01ned in De\ufb01nition 2. We construct\na potential function of the junction tree so that a labeling of the junction tree has \ufb01nite potential if\nand only if the corresponding local modes are consistent. Furthermore, whenever the potential of a\njunction tree labeling is \ufb01nite, it is equal to the potential of the corresponding labeling in the original\ngraph. This construction can be achieved using a standard junction tree construction algorithm, as\nlong as the local mode set of each ball is given.\n\nThe M-modes problem is then reduced to computing the M lowest potential labelings of the\njunction tree. This is the M-best labeling problem and can be solved ef\ufb01ciently using Nilsson\u2019s\nalgorithm [18]. The algorithm of this section is summarized in the Procedure Compute-M-Modes.\nProcedure 1 Compute-M-Modes\nInput: A tree G, a potential function f and a scale \u03b4\nOutput: The M modes of the lowest potential\n1: Construct geodesic balls B = {Br(c) | c \u2208 V}, where r = (cid:98) \u03b4\n2: for all B \u2208 B do\n3: M\u03b4\n4: Construct a junction tree (Figure 2). The label set of each supernode is its local modes.\n5: Compute the M lowest-potential labelings of the junction tree, using Nilsson\u2019s algorithm.\n\nB = the set of local modes of B\n\n2(cid:99) + 1\n\n3.3 Computing local modes via enumeration\nIt remains to compute all local modes of each geodesic ball B. We give two different algorithms in\nSec. 3.3 and 3.4. Both methods have two steps. First, compute a set of candidate partial labelings.\nSecond, choose from these candidates the ones that satisfy De\ufb01nition 2.\nIn both methods, it is\nessential to ensure the candidate set contains all local modes.\n\nComputing a candidate set. The \ufb01rst method enumerates through all possible labelings of\nthe boundary. For each boundary labeling x\u2202B, we compute a corresponding subset of candidates.\nEach candidate is the partial labeling of the minimal potential with boundary labeling x\u2202B and a\n\ufb01xed label (cid:96) of the center c. This subset has L elements since c has L labels. Formally, the candidate\n\nsubset for a \ufb01xed boundary labeling x\u2202B is CB(x\u2202B) =(cid:8)argminyB\n\nfB(yB)|y\u2202B = x\u2202B, yc \u2208 L(cid:9).\n\nIt can be computed using a standard message-passing algorithm over the tree, using c as the root.\nDenote by XB and X\u2202B the space of all partial labelings of B and \u2202B respectively. The\ncandidate set we compute is the union of candidate subsets of all boundary labelings, i.e. CB =\n\n5\n\n\f(cid:83)\n\nCB(x\u2202B). See Figure 2(right-bottom) for an example candidate set. We can show that\n\nx\u2202B\u2208X\u2202B\nthe computed candidate set CB contains all local modes of B.\nTheorem 3.2. Any local mode yB belongs to the candidate set CB.\nBefore proving the theorem, we formalize an assumption of the geodesic balls.\n\nAssumption 1 (well-centered). We assume that after removing the center from int(B), each con-\nnected component of the remaining graph has a size smaller than \u03b4.\n\nB which only disagree with yB within one of these components. And y(cid:48)(cid:48)\n\nFor example, in Figure 2(right-top), a geodesic ball of radius 3 has three connected components\nin int(B)\\{c}, of size one, two and three, respectively. Since r = (cid:98) \u03b4\n2(cid:99) + 1, \u03b4 is either four or\n2(cid:99),\n\ufb01ve. The ball is well-centered. Since the interior of B is essentially a ball of radius r \u2212 1 = (cid:98) \u03b4\nthe assumption is unlikely to be violated, as we observed in practice. In the worst case when the\nassumption is violated, we can still solve the problem by adding additional centers in the middle of\nthese connected components. Next we prove the theorem.\nProof of Theorem 3.2. We prove by contradiction. Suppose there is a local mode yB /\u2208 XB(x\u2202B)\nB \u2208 XB(x\u2202B) be the candidate\nsuch that y\u2202B = x\u2202B. Let (cid:96) be the label of yB at the center c. Let y(cid:48)\nwith the same label at the center. Furthermore, the two partial labelings agree at \u2202B and at c.\nTherefore the two labelings differ at a set of connected subgraphs. Each of the subgraphs has a size\nsmaller than \u03b4, due to Assumption 1. Since y(cid:48)\nB has a smaller potential than yB by de\ufb01nition, we can\n\ufb01nd a partial labeling y(cid:48)(cid:48)\nB has\na smaller potential than yB. Therefore yB cannot be a local mode. Contradiction.\nVerifying each candidate. Next, we show how to check whether a candidate is a local mode.\nFor a given boundary labeling, x\u2202B, we denote by XB(x\u2202B) the space of all partial labelings with\n\ufb01xed boundary labeling x\u2202B. By de\ufb01nition, a candidate yB \u2208 XB(x\u2202B) is a local mode if and\nonly if there is no other partial labeling in XB(x\u2202B) within \u03b4 from yB with a smaller potential. The\nveri\ufb01cation of yB can be transformed into a global mode veri\ufb01cation problem and solved by the\nalgorithm in Sec. 3.1. We use the subgraph B and its potential to construct a new graph. We need\nto ensure that only labelings with the \ufb01xed boundary labeling x\u2202B are considered in this new graph.\nThis can be done by enforcing each boundary node i \u2208 \u2202B to have xi as the only feasible label.\n3.4 Computing local modes using local modes of smaller scales\nIn Sec. 3.3, we computed the candidate set by enumerating all boundary labelings x\u2202B. In this\nsubsection, we present an alternative method when the local modes of the scale \u03b4 \u2212 1 has been\ncomputed. We construct a new candidate set using local modes of scale \u03b4 \u2212 1. This candidate\nset is smaller that the candidate set from the previous subsection and thus leads to a more ef\ufb01cient\nalgorithm. Since our algorithm computes modes from small scale to large scale. This algorithm can\nbe used in all scales except for \u03b4 = 1. The step of verifying whether each candidate is a local mode\nis the same as the previous subsection.\nThe following notations will prove convenient. Denote by r and r(cid:48) the radii of balls for scales \u03b4\nand \u03b4 \u2212 1 respectively (See Sec. 3.2 for the de\ufb01nition). Denote by Bi and B(cid:48)\ni the balls centered at\nnode i for scales \u03b4 and \u03b4 \u2212 1. Let M\u03b4\nbe their sets of local modes at scales \u03b4 and \u03b4 \u2212 1\nrespectively. Our idea is to use M\u03b4\u22121\nB(cid:48)\n= M\u03b4\u22121\nB(cid:48)\n\ni. By\nde\ufb01nition, M\u03b4\n. We can directly use the local modes of the previous scale as\nthe candidate set for the current scale. When \u03b4 is even, r = r(cid:48) + 1. The ball Bi is the union of the\nB(cid:48)\nB(cid:48)\nj, where Ni is the set of neighbors of i. We collect the set\nj\u2208Ni\nof all consistent combinations of M\u03b4\u22121\nfor all j \u2208 Ni as the candidate set. This set is a superset of\nB(cid:48)\nM\u03b4\n\n, because a local mode at scale \u03b4 has to be a local mode at scale \u03b4 \u2212 1.\nBi\nDropping unused local modes. In practice, we observe that a large amount of local modes\ndo not contribute to any global mode. These unused local modes can be dropped when computing\nglobal modes and when computing local modes of larger scales. To check if a local mode of Bi can\nbe dropped, we compare it with all local modes of an adjacent ball Bj, j \u2208 Ni. If it is not consistent\nwith any local mode of Bj, we drop it. We go through all adjacent balls Bj in order to drop as many\nlocal modes as possible.\n\nConsider two different cases, \u03b4 is odd and even. When \u03b4 is odd, r = r(cid:48) and Bi = B(cid:48)\n\nand M\u03b4\u22121\nB(cid:48)\n\u2019s to compute a candidate set containing M\u03b4\n\n\u2286 M\u03b4\u22121\n\nj\u2019s for all j adjacent to i, Bi =(cid:83)\n\nBi\n\nBi\n\ni\n\n.\n\nBi\n\ni\n\nBi\n\ni\n\nj\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 3: Scalability.\n\n3.5 Complexity\nThere are three steps in our algorithm for each \ufb01xed \u03b4: computing, verifying candidates and comput-\ning the M best labelings of the junction tree. Denote by d the tree degree. Denote by \u03bb the maximum\nnumber of undropped local modes for any ball B and scale \u03b4. When \u03b4 = 1, we use the enumeration\nmethod. Since the ball radius is 1, the ball boundary size is O(d). There are at most Ld many can-\ndidates for each ball. When \u03b4 > 1, we use local modes of the scale \u03b4 \u2212 1 to construct the candidate\nset. Since each ball of scale \u03b4 is the union of O(d) many balls of scale \u03b4 \u2212 1, there are at most \u03bbd\nmany candidates per node. The veri\ufb01cation takes O(DdL\u03b42(L + \u03b4)) time per candidate. (See [7] for\ncomplexity analysis of the veri\ufb01cation algorithm.) Therefore overall the computation and veri\ufb01ca-\ntion of all local modes for all D balls is O(D2dL\u03b42(L + \u03b4)(Ld + \u03bbd)). The last step runs Nilsson\u2019s\nalgorithm on a junction tree with label size O(\u03bb), and thus takes O(D\u03bb2 +M D\u03bb+M D log(M D)).\nSumming up these complexities gives the \ufb01nal complexity.\n\nScalability. Even though our algorithm is not polynomial to all relevant parameters, it is ef\ufb01cient\nin practice. The complexity is exponential to the tree degree (d). However, in practice, we can\nenforce an upperbound of the tree degree in the model estimation stage. This way we can assume\nd to be constant. Another parameter in the complexity is \u03bb, the maximal number of undropped\nlocal modes of a geodesic ball. When the scale \u03b4 is large, \u03bb could be exponential to the graph size.\nHowever, in practice, we observe that \u03bb decreases quickly as \u03b4 increases. Therefore, our algorithm\ncan \ufb01nish in a reasonable time. See Sec. 4 for more discussions.\n4 Experiment\nTo validate our method, we \ufb01rst show the scalability and accuracy of our algorithm in synthetic data.\nFurthermore, we demonstrate using biological data how modes can be used as a novel analysis tool.\nQuantitative analysis of modes reveals new insight of the data. This \ufb01nding is well supported by a\nvisualization of the modes, which intuitively outlines the topographical map of the distribution. In\nall experiments, we choose M to be 500. At bigger scales, there are often less than M modes in\ntotal. As mentioned earlier, modes can also be applied to the problem of multiple predictions [7].\n\nScalability. We randomly generate tree-structured graphical model (tree size D =200 . . . 2000,\nlabel size L = 3) and test the speed. For each tree size, we generates 100 random data. In Figure\n3(a), we show the running time of our algorithm to compute modes of all scales. The running time\nis roughly linear to the graph size. In Figure 3(b) we show the average running time for each delta\nwhen the graph size is 200, 1000 and 2000. As we see most of the computation time is spent on\ncomputations with \u03b4 = 1 and 2. Note only when \u03b4 = 1, the enumeration method is used. When\n\u03b4 \u2265 2, we reuse local modes of previous \u03b4. The algorithm speed depends on the parameter \u03bb, the\nmaximum number of undropped local modes over all balls. In Figure 3(c), we show that \u03bb drops\nquickly as the scale increases. We believe this is critical to the overall ef\ufb01ciency of our method. In\nFigure 3(d), we show the average number of global modes at different scales.\n\nAccuracy. We randomly generate tree-structured distributions (D = 20, L = 2). We select\nthe trees with strong modes as ground-truth trees, i.e. those with at least two modes up to \u03b4 = 7.\nSee Figure 4(a) for the average number of modes at different scales over these selected tree models.\nNext we sample these trees and then use the samples to estimate a tree model to approximate this\ndistribution. Finally we compute modes of the estimated tree and compare them to the modes of the\nground-truth trees.\n\nTo evaluate the sensitivity of our method to noise, we randomly \ufb02ip 0%, 5%, 10%, 15% and 20%\nlabels of these samples. We compare the number of predicted modes to the number of true modes\nfor each scale. The error is normalized by the number of true modes. See Figure 4(b). With small\nnoise, our prediction is accurate except for \u03b4 = 1, when the number of true modes is very large. As\nthe noise level increases, the error increases linearly. We do notice an increase of error at near \u03b4 = 7.\nThis is because at \u03b4 = 8, many data become unimodal. Predicting two modes leads to 50% error.\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 4: Accuracy. Denote by \u0001 the noise level, n the sample size.\n\nWe also measure the prediction accuracy using the Hausdorff distance between the predicted\nmodes and the true modes. The Hausdorff distance between two \ufb01nite points sets X and Y is\nde\ufb01ned as max (maxx\u2208X miny\u2208Y \u03c1(x, y), maxy\u2208Y minx\u2208X \u03c1(x, y)). The result is shown in Figure\n4(c). We normalize the error using the tree size D. So the error is between zero and one. The error\nis again increasing linearly w.r.t. the noise level. An increase at \u03b4 = 7 is due to the fact that many\ndata change from multiple modes to one single mode. In Figure 4(d), we compare for a same noise\nlevel the error when we use different sample sizes. When the sample size is 10K, we have bigger\nerror. When the sample size is 80K and 40K, the errors are similar and small.\n\nBiological data analysis. We compute modes of the microarray data of Arabidopsis thaliana\nplant (108 samples, 39 dimensions) [24]. Each gene has three labels: \u201c+\u201d, \u201c0\u201d and \u201c-\u201d respectively\ndenote over-expression, normal-expression and under-expression of the genes. Based on the data\nsample we estimate the tree graph and compute the top modes with different radiuses \u03b4 using Ham-\nming distance. We use multidimensional scaling to map these modes so that their pairwise Hamming\ndistance is approximated by the L2 distance in R2. The result is visualized in Fig. 5 with different\nscales. The size of the points is proportional to the log of its probability. Arrows in the \ufb01gure show\nhow each mode merges to survived modes at the larger scale. The graph intuitively shows that there\nare two major modes when viewed from a large scale and even shows how the modes evolve as we\nchange the scale.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 5: Microarray results. From left to right: scale 1 to 4.\n\n5 Conclusion\nThis paper studies the (\u03b4, \u03c1)-mode estimation problem for tree graphical models. The signi\ufb01cance\nof this work lies in several aspects: (1) we develop an ef\ufb01cient algorithm to illustrate the intrin-\nsic connection between structured statistical modeling and mode characterization; (2) our notion of\n(\u03b4, \u03c1)-modes provides a new tool for visualizing the topographical information of complex discrete\ndistributions. This work is the \ufb01rst step towards understanding the statistical and computational as-\npects of complex discrete distributions. For future investigations, we plan to relax the tree graphical\nmodel assumption to junction trees.\nAcknowledgments\nChao Chen thanks Vladimir Kolmogorov and Christoph H. Lampert for helpful discussions. The re-\nsearch of Chao Chen and Dimitris N. Metaxas is partially supported by the grants NSF IIS 1451292\nand NSF CNS 1229628. The research of Han Liu is partially supported by the grants NSF\nIIS1408910, NSF IIS1332109, NIH R01MH102339, NIH R01GM083084, and NIH R01HG06841.\n\n8\n\n\fReferences\n[1] F. R. Bach and M. I. Jordan. Beyond independent components: trees and clusters. The Journal of Machine\n\nLearning Research, 4:1205\u20131233, 2003.\n\n[2] D. Batra, P. Yadollahpour, A. Guzman-Rivera, and G. Shakhnarovich. Diverse M-best solutions in markov\n\nrandom \ufb01elds. Computer Vision\u2013ECCV 2012, pages 1\u201316, 2012.\n\n[3] P. Chaudhuri and J. S. Marron. SiZer for exploration of structures in curves. Journal of the American\n\nStatistical Association, 94(447):807\u2013823, 1999.\n\n[4] F. Chazal, L. J. Guibas, S. Y. Oudot, and P. Skraba. Persistence-based clustering in Riemannian manifolds.\nIn Proceedings of the 27th annual ACM symp. on Computational Geometry, pages 97\u2013106. ACM, 2011.\n[5] C. Chen and H. Edelsbrunner. Diffusion runs low on persistence fast. In IEEE International Conference\n\non Computer Vision (ICCV), pages 423\u2013430. IEEE, 2011.\n\n[6] C. Chen, V. Kolmogorov, Y. Zhu, D. Metaxas, and C. H. Lampert. Computing the M most probable modes\n\nof a graphical model. In International Conf. on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2013.\n\n[7] C. Chen, H. Liu, D. N. Metaxas, M. G. Uzunbas\u00b8, and T. Zhao. High dimensional mode estimation \u2013 a\n\ngraphical model approach. Technical report, October 2014.\n\n[8] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. Pattern Analysis\n\nand Machine Intelligence, IEEE Transactions on, 24(5):603\u2013619, 2002.\n\n[9] M. Fromer and C. Yanover. Accurate prediction for atomic-level protein design and its application\nin diversifying the near-optimal sequence space. Proteins: Structure, Function, and Bioinformatics,\n75(3):682\u2013705, 2009.\n\n[10] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random \ufb01elds: Probabilistic models for\nsegmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on\nMachine Learning (ICML), pages 282\u2013289, 2001.\n\n[11] J. Li, S. Ray, and B. G. Lindsay. A nonparametric statistical approach to clustering via mode identi\ufb01cation.\n\nJournal of Machine Learning Research, 8(8):1687\u20131723, 2007.\n\n[12] T. Lindeberg. Scale-space theory in computer vision. Springer, 1993.\n[13] H. Liu, M. Xu, H. Gu, A. Gupta, J. Lafferty, and L. Wasserman. Forest density estimation. Journal of\n\nMachine Learning Research, 12:907\u2013951, 2011.\n\n[14] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91\u2013110, 2004.\n[15] R. Maitra.\n\nInitializing partition-optimization algorithms. Computational Biology and Bioinformatics,\n\nIEEE/ACM Transactions on, 6(1):144\u2013157, 2009.\n\n[16] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and\nL. Van Gool. A comparison of af\ufb01ne region detectors. International journal of computer vision, 65(1-\n2):43\u201372, 2005.\n\n[17] M. C. Minnotte and D. W. Scott. The mode tree: A tool for visualization of nonparametric density\n\nfeatures. Journal of Computational and Graphical Statistics, 2(1):51\u201368, 1993.\n\n[18] D. Nilsson. An ef\ufb01cient algorithm for \ufb01nding the m most probable con\ufb01gurationsin probabilistic expert\n\nsystems. Statistics and Computing, 8(2):159\u2013173, 1998.\n\n[19] S. Nowozin and C. Lampert. Structured learning and prediction in computer vision. Foundations and\n\nTrends in Computer Graphics and Vision, 6(3-4):185\u2013365, 2010.\n\n[20] S. Ray and B. G. Lindsay. The topography of multivariate normal mixtures. Annals of Statistics, pages\n\n2042\u20132065, 2005.\n\n[21] B. W. Silverman. Using kernel density estimates to investigate multimodality. Journal of the Royal\n\nStatistical Society. Series B (Methodological), pages 97\u201399, 1981.\n\n[22] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and\n\ninterdependent output variables. In Journal of Machine Learning Research, pages 1453\u20131484, 2005.\n\n[23] M. Wainwright and M. Jordan. Graphical models, exponential families, and variational inference. Foun-\n\ndations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[24] A. Wille, P. Zimmermann, E. Vranov\u00b4a, A. F\u00a8urholz, O. Laule, S. Bleuler, L. Hennig, A. Prelic, P. von\nRohr, L. Thiele, et al. Sparse graphical gaussian modeling of the isoprenoid gene network in arabidopsis\nthaliana. Genome Biol, 5(11):R92, 2004.\n\n[25] A. Witkin. Scale-space \ufb01ltering. Readings in computer vision:\n\nparadigms, pages 329\u2013332, 1987.\n\nissues, problems, principles, and\n\n9\n\n\f", "award": [], "sourceid": 742, "authors": [{"given_name": "Chao", "family_name": "Chen", "institution": "Rutgers University"}, {"given_name": "Han", "family_name": "Liu", "institution": "Princeton University"}, {"given_name": "Dimitris", "family_name": "Metaxas", "institution": "Rutgers University"}, {"given_name": "Tianqi", "family_name": "Zhao", "institution": "Princeton University"}]}