{"title": "Learning a Small Mixture of Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 1051, "page_last": 1059, "abstract": "The problem of approximating a given probability distribution using a simpler distribution plays an important role in several areas of machine learning, e.g. variational inference and classification. Within this context, we consider the task of learning a mixture of tree distributions. Although mixtures of trees can be learned by minimizing the KL-divergence using an EM algorithm, its success depends heavily on the initialization. We propose an efficient strategy for obtaining a good initial set of trees that attempts to cover the entire observed distribution by minimizing the $\\alpha$-divergence with $\\alpha = \\infty$. We formulate the problem using the fractional covering framework and present a convergent sequential algorithm that only relies on solving a convex program at each iteration. Compared to previous methods, our approach results in a significantly smaller mixture of trees that provides similar or better accuracies. We demonstrate the usefulness of our approach by learning pictorial structures for face recognition.", "full_text": "Learning a Small Mixture of Trees\u2217\n\nM. Pawan Kumar\n\nComputer Science Department\n\nStanford University\n\nDaphne Koller\n\nComputer Science Department\n\nStanford University\n\npawan@cs.stanford.edu\n\nkoller@cs.stanford.edu\n\nAbstract\n\nThe problem of approximating a given probability distribution using a simpler dis-\ntribution plays an important role in several areas of machine learning, for example\nvariational inference and classi\ufb01cation. Within this context, we consider the task\nof learning a mixture of tree distributions. Although mixtures of trees can be\nlearned by minimizing the KL-divergence using an EM algorithm, its success de-\npends heavily on the initialization. We propose an ef\ufb01cient strategy for obtaining\na good initial set of trees that attempts to cover the entire observed distribution by\nminimizing the \u03b1-divergence with \u03b1 = \u221e. We formulate the problem using the\nfractional covering framework and present a convergent sequential algorithm that\nonly relies on solving a convex program at each iteration. Compared to previous\nmethods, our approach results in a signi\ufb01cantly smaller mixture of trees that pro-\nvides similar or better accuracies. We demonstrate the usefulness of our approach\nby learning pictorial structures for face recognition.\n\n1 Introduction\nProbabilistic models provide a powerful and intuitive framework for formulating several problems\nin machine learning and its application areas, such as computer vision and computational biology. A\ncritical choice to be made when using a probabilistic model is its complexity. For example, consider\na system that involves n random variables. A probabilistic model that de\ufb01nes a clique of size n\nhas the ability to model any distribution over these random variables. However, the task of learning\nand inference on such a model becomes computationally intractable. The other extreme case is to\nde\ufb01ne a tree structured model that allows for ef\ufb01cient learning [3] and inference [23]. However, tree\ndistributions have a restrictive form. Hence, they are not suitable for all applications.\n\nA natural way to alleviate the de\ufb01ciencies of tree distributions is to use a mixture of trees [21].\nMixtures of trees can be employed as accurate models for several interesting problems such as pose\nestimation [11] and recognition [5, 12]. In order to facilitate their use, we consider the problem\nof learning them by approximating an observed distribution. Note that the mixture can be learned\nby minimizing the Kullback-Leibler (KL) divergence with respect to the observed distribution using\nan expectation-maximization (EM) algorithm [21]. However, there are two main drawbacks of this\napproach: (i) minimization of KL divergence mostly tries to explain the dominant mode of the\nobserved distribution [22], that is it does not explain the entire distribution; and (ii) as the EM\nalgorithm is prone to local minima, its success depends heavily on the initialization. An intuitive\nsolution to both these problems is to obtain an initial set of trees that covers as much of the observed\ndistribution as possible. To this end, we pose the learning problem as that of obtaining a set of trees\nthat minimize a suitable \u03b1-divergence [25].\nThe \u03b1-divergence measures are a family of functions over two probability distributions that measure\nthe information gain contained in them: that is, given the \ufb01rst distribution, how much information\nis obtained by observing the second distribution. They form a complete family of measures, in that\nno other function satis\ufb01es all the postulates of information gain [25]. When used as an objective\n\n\u2217This work was supported by DARPA SA4996-10929-4 and the Boeing company.\n\n1\n\n\ffunction to approximate an observed distribution, the value of \u03b1 plays a signi\ufb01cant role. For exam-\nple, when \u03b1 = 1 we obtain the KL divergence. As the value of \u03b1 keeps increasing, the divergence\nmeasure becomes more and more inclusive [8], that is it tries to cover as much of the observed dis-\ntribution as possible [22]. Hence, a natural choice for our task of obtaining a good initial estimate\nwould be to set \u03b1 = \u221e.\nWe formulate the minimization of \u03b1-divergence with \u03b1 = \u221e within the fractional covering frame-\nwork [24]. However, the standard iterative algorithm for solving fractional covering is not readily\napplicable to our problem due to its small stepsize. In order to overcome this de\ufb01ciency we adapt\nthis approach speci\ufb01cally for the task of learning mixtures of trees. Each iteration of our approach\nadds one tree to the mixture and only requires solving a convex optimization problem. In practice,\nour strategy converges within a small number of iterations thereby resulting in a small mixture of\ntrees. We demonstrate the effectiveness of our approach by providing a comparison with state of the\nart methods and learning pictorial structures [6] for face recognition.\n2 Related Work\nThe mixture of trees model was introduced by Meila and Jordan [21] who highlighted its appeal\nby providing simple inference and sampling algorithms. They also described an EM algorithm that\nlearned a mixture of trees by minimizing the KL divergence. However, the accuracy of the EM\nalgorithm is highly dependent on the initial estimate of the mixture. This is evident in the fact\nthat their experiments required a large mixture of trees to explain the observed distribution, due to\nrandom initialization.\n\nSeveral works have attempted to obtain a good set of trees by devising algorithms for minimizing\nthe KL divergence [8, 13, 19, 26]. In contrast, our method uses \u03b1 = \u221e, thereby providing a set of\ntrees that covers the entire observed distribution. It has been shown that mixture of trees admit a\ndecomposable prior [20]. In other words, one can concisely specify a certain prior probability for\neach of the exponential number of tree structures for a given set of random variables. Kirschner and\nSmyth [14] have also proposed a method to handle a countably in\ufb01nite mixture of trees. However,\nthe complexity of both learning and inference in these models restricts their practical use.\n\nResearchers have also considered mixtures of trees in the log-probability space. Unlike a mixture in\nthe probability space considered in this paper (which contains a hidden variable), mixtures of trees\nin log-probability space still de\ufb01ne pairwise Markov networks. Such mixtures of trees have been\nused to obtain upper bounds on the log partition function [27]. However, in this case, the mixture is\nobtained by considering subgraphs of a given graphical model instead of minimizing a divergence\nmeasure with respect to the observed data. Finally, we note that semi-metric distance functions can\nbe approximated to a mixture of tree metrics using the fractional packing framework [24]. This\nallows us to approximate semi-metric probabilistic models to a simpler mixture of (not necessarily\ntree) models whose pairwise potentials are de\ufb01ned by tree metrics [15, 17].\n3 Preliminaries\nTree Distribution. Consider a set of n random variables V = {v1, \u00b7 \u00b7 \u00b7 , vn}, where each variable\nva can take a value xa \u2208 Xa. We represent a labeling of the random variables (i.e. a particular\nassignment of values) as a vector x = {xa|a = 1, \u00b7 \u00b7 \u00b7 , n}. A tree structured model de\ufb01ned over the\nrandom variables V is a graph whose nodes correspond to the random variables and whose edges E\nde\ufb01ne a tree. Such a model assigns a probability to each labeling that can be written as\n\nPr(x|\u03b8T ) =\n\n1\n\nZ(\u03b8T ) Q(va,vb)\u2208E \u03b8T\n\nab(xa, xb)\na (xa)deg(a)\u22121 .\n\nQva\u2208V \u03b8T\n\nPr(x|\u03b8M ) =XT\n\n2\n\na (\u00b7) refers to unary potentials whose values depend on one variable at a time, and \u03b8T\n\nHere \u03b8T\nab(\u00b7, \u00b7)\nrefers to pairwise potentials whose values depend on two neighboring variables at a time. The vector\n\u03b8T is the parameter of the model (which consists of all the potentials) and Z(\u03b8T ) is the partition\nfunction which ensures that the probability sums to one. The term deg(a) denotes the degree of the\nvariable va.\nMixture of Trees. As the name suggests, a mixture of trees is de\ufb01ned by a set of trees along with\na probability distribution over them, that is \u03b8M = {(\u03b8T , \u03c1T )} such that mixture coef\ufb01cients \u03c1T > 0\n\nfor all T andPT \u03c1T = 1. It de\ufb01nes the probability of a given labeling as\n\n\u03c1T Pr(x|\u03b8T ).\n\n(1)\n\n(2)\n\n\f\u03b1-Divergence. The \u03b1-divergence between distributions Pr(\u00b7|\u03b81) (say the observed distribution)\nand Pr(\u00b7|\u03b82) (the simpler distribution) is given by\n\nD\u03b1(\u03b81||\u03b82) =\n\n1\n\n\u03b1 \u2212 1\n\nlog Xx\n\nPr(x|\u03b81)\u03b1\n\nPr(x|\u03b82)\u03b1\u22121! .\n\n(3)\n\nThe \u03b1-divergence measure is strictly non-negative and is equal to 0 if and only if \u03b81 is a reparame-\nterization of \u03b82. It is a generalization of KL divergence which corresponds to \u03b1 = 1, that is\n\nD1(\u03b81||\u03b82) =Xx\n\nPr(x|\u03b81) log\n\nPr(x|\u03b81)\nPr(x|\u03b82)\n\n.\n\nAs mentioned earlier, we are interested in the case where \u03b1 = \u221e, that is\n\nD\u221e(\u03b81||\u03b82) = max\n\nx\n\nlog\n\nPr(x|\u03b81)\nPr(x|\u03b82)\n\n.\n\n(4)\n\n(5)\n\nThe inclusive property of \u03b1 = \u221e is evident from the above formula. Since we would like to\nminimize the maximum ratio of probabilities (i.e. the worst case), we need to ensure that no value of\nPr(x|\u03b82) is very small, that is the entire distribution is covered. In contrast, the KL divergence can\nadmit very small values of Pr(x|\u03b82) since it is concerned with the summation shown in equation (4)\n(and not the worst case). To avoid confusion, we shall refer to the case where \u03b1 = 1 as KL divergence\nand the \u03b1 = \u221e case as \u03b1-divergence throughout this paper.\nThe Learning Problem. Given a set of samples {xi, i = 1, \u00b7 \u00b7 \u00b7 , m} along with their probabilities\n\u02c6P (xi), our task is to learn a mixture of trees \u03b8M \u2217 such that\n\n\u03b8M \u2217\n\n= arg min\n\n\u03b8M max\n\ni\n\nlog\n\n\u02c6P (xi)\n\nPr(xi|\u03b8M )! = arg max\n\n\u03b8M min\n\ni\n\nPr(xi|\u03b8M )\n\n\u02c6P (xi) ! .\n\n(6)\n\nWe will concentrate on the second form in the above equation (where the logarithm has been\ndropped). We de\ufb01ne T = {\u03b8Tj } to be the set of all t tree distributions that are de\ufb01ned over n\nvariables. It follows that the probability of a labeling for any mixture of trees can be written as\n\nPr(x|\u03b8M ) =Xj\n\n\u03c1j Pr(x|\u03b8Tj ),\n\n(7)\n\nfor suitable values of \u03c1j. Note that the mixing coef\ufb01cients \u03c1 should de\ufb01ne a valid probability\ndistribution. In other words, \u03c1 belongs to the polytope P de\ufb01ned as\n\n\u03c1 \u2208 P \u21d2Xj\n\n\u03c1j = 1, \u03c1j \u2265 0, \u2200j = 1, \u00b7 \u00b7 \u00b7 , t.\n\n(8)\n\nOur task is to \ufb01nd a sparse vector \u03c1 that minimizes the \u03b1-divergence with respect to the observed dis-\ntribution. In order to formally specify the minimization of \u03b1-divergence as an optimization problem,\nwe de\ufb01ne an m \u00d7 t matrix A and an m \u00d7 1 vector b such that\n\nA(i, j) = Pr(xi|\u03b8Tj ) and bi = \u02c6P (xi).\n\n(9)\n\nWe denote the ith row of A as ai and the ith element of b as bi. Using the above notation, the\nlearning problem can be speci\ufb01ed as\n\ns.t.\n\nmax\n\n\u03c1\n\n\u03bb\u03c1,\n\nai\u03c1 \u2265 \u03bb\u03c1bi, \u2200i\n\u03c1 \u2208 P,\n\n(10)\n\nwhere \u03bb\u03c1 = mini ai\u03c1/bi due to the form of the above LP. The above formulation suggests that a\nnatural way to attack the problem would be to use the fractional covering framework [24]. We begin\nby brie\ufb02y describing fractional covering in the next section.\n\n3\n\n\f4 Fractional Covering\nGiven an m\u00d7 t matrix A and an m\u00d7 1 vector b > 0, the fractional covering problem is to determine\nwhether there exists a vector \u03c1 \u2208 P such that A\u03c1 \u2265 b. The only restriction on the polytope P is\nthat A\u03c1 \u2265 0 for all \u03c1 \u2208 P, which is clearly satis\ufb01ed by our learning problem (since ai\u03c1 is the\nprobability of xi speci\ufb01ed by the mixture of trees corresponding to \u03c1). Let\n\n\u03bb\u2217 = max\n\n\u03c1\n\nmin\n\ni\n\nai\u03c1\nbi\n\n.\n\n(11)\n\nIf \u03bb\u2217 < 1 then clearly there does not exist a \u03c1 such that A\u03c1 \u2265 b. However, if \u03bb\u2217 \u2265 1, then the\nfractional covering problem requires us to \ufb01nd an \u01eb-optimal solution, that is \ufb01nd a \u03c1 such that\n\nA\u03c1 \u2265 (1 \u2212 \u01eb)\u03bb\u2217b,\n\n(12)\n\nwhere \u01eb > 0 is a user-speci\ufb01ed tolerance factor. Using the de\ufb01nitions of A, b and \u03c1 from the\nprevious section, we observe that in our case \u03bb\u2217 = 1. In other words, there exists a solution such\nthat A\u03c1 = b. This can easily be seen by considering a tree with parameter \u03b8Tj such that\n\nPr(xi|\u03b8Tj ) =(cid:26) 1\n\nif\n\n0 otherwise,\n\ni = j,\n\n(13)\n\nand setting \u03c1j = \u02c6P (xj ). The above solution provides an \u03b1-divergence of 0 but at the cost of\nintroducing m trees in the mixture (where m is the number of samples provided). We would like\nto \ufb01nd an \u01eb-optimal solution with a smaller number of trees by solving the LP (10). However, we\ncannot employ standard interior point algorithms for optimizing problem (10). This is due to the\nfact that each of its m constraints is de\ufb01ned over an in\ufb01nite number of unknowns (speci\ufb01cally, the\nmixture coef\ufb01cients for each of the in\ufb01nite number of tree distributions de\ufb01ned over the n random\nvariables). Fortunately, Plotkin et al. [24] provide an iterative algorithm for solving problem (10)\nthat can handle arbitrarily large number of unknowns in every constraint.\nThe Fractional Covering Algorithm.\nfollowing related problem:\n\nIn order to obtain a solution to problem (10), we solve the\n\nmin\n\u03c1\u2208P\n\nyi =\n\ns.t.\n\n\u03a6(y) \u2261 y\u22a4b,\n\n1\nbi\n\nexp(cid:18)\u2212\u03b2\n\nai\u03c1\n\nbi (cid:19) .\n\n(14)\n\nThe objective function \u03a6(y) is called the potential function for fractional covering. Plotkin et al.\n[24] showed that minimizing \u03a6(y) solves the original fractional covering problem. The term \u03b2 is a\nparameter that is inversely proportional to the stepsize \u03c3 of the algorithm. The fractional covering\nalgorithm is an iterative strategy. At iteration t, the variable \u03c1t is updated as \u03c1t \u2190 (1\u2212\u03c3)\u03c1t\u22121 +\u03c3\u03c1\u2032\nsuch that the update attempts to decrease the potential function. Speci\ufb01cally, the algorithm proposed\nin [24] suggests using the \ufb01rst order approximation of \u03a6(y), that is\n\n\u03c1\u2032 = arg min\n\n\u03c1 Xi\n\ny\u2032\n\ni(bi \u2212 \u03b2\u03c3ai\u03c1)! = arg max\n\n\u03c1\n\nwhere\n\ny\u2032\u22a4A\u03c1.\n\n(15)\n\n(16)\n\ny\u2032\ni =\n\n1\nbi\n\nexp(cid:18)\u2212\u03b2\n\n(1 \u2212 \u03c3)ai\u03c1\n\nbi\n\n(cid:19) .\n\nTypically, the above problem is easy to solve (including for our case, as will be seen in the next\nsection). Furthermore, for a suf\ufb01ciently large value of \u03b2 (\u221d log m) the above update rule decreases\n\u03a6(y). In more detail, the algorithm of [24] is as follows:\n\n\u2022 De\ufb01ne w = max\u03c1 maxi ai\u03c1/bi to be the width of the problem.\n\u2022 Start with an initial solution \u03c10.\n\u2022 De\ufb01ne \u03bb\u03c10 = mini ai\u03c10/bi, and \u03c3 = \u01eb/(4\u03b2w).\n\u2022 While \u03bb\u03c1 < 2\u03bb\u03c10, at iteration t:\n\n\u2013 De\ufb01ne y\u2032 as shown in equation (16).\n\u2013 Find \u03c1\u2032 = arg max\u03c1\u2208P y\u2032\u22a4A\u03c1.\n\u2013 Update \u03c1t \u2190 (1 \u2212 \u03c3)\u03c1t\u22121 + \u03c3\u03c1\u2217.\n\n4\n\n\fPlotkin et al. [24] suggest starting with a tolerance factor of \u01eb0 = 1/6 and dividing the value of \u01eb0\nby 2 after every call to the above procedure terminates. This process is continued until a suf\ufb01ciently\naccurate (i.e. an \u01eb-optimal) solution is recovered. Note that during each call to the above procedure\nthe potential function \u03a6(y) is both upper and lower bounded, speci\ufb01cally\n\nexp(\u22122\u03b2\u03bb\u03c10 ) \u2264 \u03a6(y) \u2264 m exp(\u2212\u03b2\u03bb\u03c10).\n\n(17)\n\nFurthermore, we are guaranteed to decrease the value of \u03a6(y) at each iteration. Hence, it follows\nthat the above algorithm will converge. We refer the reader to [24] for more details.\n5 Modifying Fractional Covering\nThe above algorithm provides an elegant way to solve the general fractional covering problem.\nHowever, as will be seen shortly, in our case it leads to undesirable solutions. Nevertheless, we\nshow that appropriate modi\ufb01cations can be made to obtain a small and accurate mixture of trees. We\nbegin by identify the de\ufb01ciencies of the fractional covering algorithm for our learning problem.\n5.1 Drawbacks of the Algorithm\nThere are two main drawbacks of fractional covering. First, the value of \u03b2 is typically very large,\nwhich results in a small stepsize \u03c3. In our experiments, \u03b2 was of the order of 103, which resulted\nin slow convergence of the algorithm. Second, the update step provides singleton trees, that is trees\nwith a probability of 1 for one labeling and 0 for all others. This is due to the fact that, in our case,\nthe update step solves the following problem:\n\nmax\n\n\u03c1\u2208PXj Xi\n\ny\u2032\n\ni\u03c1j Pr(xi|\u03b8Tj )! .\n\n(18)\n\nNote that the above problem is an LP in \u03c1. Hence, there must exist an optimal solution on the vertex\non the polytope P. In other words, we obtain a single tree distribution \u03b8T \u2217 such that\n\n\u03b8T \u2217\n\n= arg max\n\n\u03b8T Xi\n\ny\u2032\n\ni Pr(xi|\u03b8T )! .\n\n(19)\n\nThe optimal tree distribution for the above problem concentrates the entire mass on the sample\ni. Such singleton trees are not desirable as they also result in slow\nxi\u2032 where i\u2032 = arg maxi y\u2032\nconvergence of the algorithm. Furthermore, the learned mixture only provides a non-zero probability\nfor the samples used during training. Hence, the mixture cannot be used for previously unseen\nsamples, thereby rendering it practically useless. Note that the method of Rosset and Segal [26]\nalso faces a similar problem during their update steps for minimizing the KL divergence. In order to\novercome this dif\ufb01culty, they suggest approximating problem (18) by\n\n\u03b8T \u2217\n\n= arg max\n\n\u03b8T Xi\n\ny\u2032\n\ni log(cid:16)Pr(xi|\u03b8T )(cid:17) ,\n\n(20)\n\nwhich can be solved ef\ufb01ciently using the Chow-Liu algorithm [3]. However, our preliminary exper-\niments (accuracies not reported) indicate that this approach does not work well for minimizing the\npotential function \u03a6(y).\n5.2 Fixing the Drawbacks\nWe adapt the original fractional covering algorithm for our problem in order to overcome the draw-\nbacks mentioned above. The \ufb01rst drawback is handled easily. We start with a small value of \u03b2 and\nincrease it by a factor of 2 if we are not able to reduce the potential function \u03a6(y) at a given itera-\ntion. Since we are assured that the value of \u03a6(y) decreases for a \ufb01nite value of \u03b2, this procedure is\nguaranteed to terminate. In our experiments, we initialized \u03b2 = 1/w and its value never exceeded\n32/w. Note that choosing \u03b2 to be inversely proportional to w ensures that the initial values of y\u2032\ni in\nequation (16) are suf\ufb01ciently large (at least exp(\u2212(1 \u2212 \u03c3))).\nIn order to address the second drawback, we note that our aim at an iteration t of the algorithm is to\nreduce the potential function \u03a6(y). That is, given the current distribution parameterized by \u03b8Mt we\nwould like to add a new tree \u03b8Tt to the mixture that solves the following problem:\n\n\u03b8Tt =\n\narg min\n\n\u03b8T \"\u03a6(y) \u2261Xi\n\ny\u2032\n\ni exp \u2212\u03b2\n\n\u03c3 Pr(xi|\u03b8T )\n\n\u02c6P (xi) !#\n\n(21)\n\n5\n\n\fs.t. Xi\n\nPr(xi|\u03b8T ) \u2264 1, Pr(xi|\u03b8T ) \u2265 0, \u2200i = 1, \u00b7 \u00b7 \u00b7 , m,\n\n(22)\n\n\u03b8T \u2208 T .\n\n(23)\nHere, T is the set of all tree distributions de\ufb01ned over n random variables. Note that the algorithm\nof [24] optimizes the \ufb01rst order approximation of the objective function (21). However, as seen pre-\nviously, for our problem this results in an undesirable solution. Instead, we directly optimize \u03a6(y)\nusing an alternative two step strategy. In the \ufb01rst step, we drop the last constraint from the above\nproblem. In other words, we obtain the values of Pr(xi|\u03b8T ) that form a valid (but not necessarily\ntree-structured) distribution and minimize the function \u03a6(y). Note that since the \u03a6(y) is not linear\nin Pr(xi|\u03b8T ), the optimal solution provides a dense distribution Pr(\u00b7|\u03b8T ) (as opposed to the \ufb01rst\norder linear approximation which provides a singleton distribution). In the second step, we project\nthese values to a tree distribution. It is easy to see that dropping constraint (23) results in a convex\nrelaxation of the original problem. We solve the convex relaxation using a log-barrier method [1].\nBrie\ufb02y, this implies solving a series of unconstrained optimization problems until we are within a\nuser-speci\ufb01ed tolerance value of \u03c4 from the optimal solution. Speci\ufb01cally,\n\n\u2022 Set f = 1.\n\u2022 Solve min\n\u2022 If m/f \u2264 \u03c4, then stop. Otherwise, update f = \u00b5f and repeat the previous step.\n\n)(cid:16)f \u03a6(y) \u2212Pi log(Pr(xi|\u03b8T )) \u2212 log(1 \u2212Pi Pr(xi|\u03b8T ))(cid:17).\n\nPr(\u00b7|\u03b8T\n\nWe used \u00b5 = 1.5 in all our experiments, which was suf\ufb01cient to obtain accurate solutions for\nthe convex relaxation. At each iteration, the unconstrained optimization problem is solved using\nNewton\u2019s method. Recall that Newton\u2019s method minimizes a function g(z) by updating the current\nsolution as\n\ng(z) \u2190 g(z) \u2212(cid:0)\u22072g(z)(cid:1)\u22121\n\n(24)\nwhere \u22072g(\u00b7) denotes the Hessian matrix and \u2207g(\u00b7) denotes the gradient vector. Note that the most\nexpensive step in the above approach is the inversion of the Hessian matrix. However, it is easy to\nverify that in our case all the off-diagonal elements of the Hessian are equal to each other. By taking\nadvantage of this special form of the Hessian, we compute its inverse in O(m2) time using Gaussian\nelimination (i.e. linear in the number of elements of the Hessian).\n\n\u2207g(z),\n\nOnce the values of Pr(xi|\u03b8T ) are computed in this manner, they are projected to a tree distribution\nusing the Chow-Liu algorithm [3]. Note that after the projection step we are no longer guaranteed to\ndecrease the function \u03a6(y). This would imply that the overall algorithm would not be guaranteed to\nconverge. In order to overcome this problem, if we are unable to decrease \u03a6(y) then we determine\nthe sample xi\u2032 such that\n\ni\u2032 = arg max\n\ni\n\nPr(xi|\u03b8Mt )\n\n\u02c6P (xi)\n\n,\n\n(25)\n\nthat is the sample best explained by the current mixture. We enforce Pr(xi\u2032 |\u03b8T ) = 0 and solve\nthe above convex relaxation again. Note that the solution to the new convex relaxation (i.e. the one\nwith the newly introduced constraint for sample xi\u2032 ) can easily be obtained from the solution of the\nprevious convex relaxation using the following update:\n\nPr(xi|\u03b8T ) \u2190(cid:26) Pr(xi|\u03b8T ) + \u02c6P (xi) Pr(xi\u2032 |\u03b8T )/s\n\n0\n\nif\n\notherwise,\n\ni 6= i\u2032,\n\n(26)\n\nwhere s = Pi\n\n\u02c6P (xi). In other words, we do not need to use the log-barrier method to solve the\nnew convex relaxation. We then project the updated values of Pr(xi|\u03b8T ) to a tree distribution. This\nprocess of eliminating one sample and projecting to a tree is repeated until we are able to reduce\nthe value of \u03a6(y). Note that in the worst case we will eliminate all but one sample (speci\ufb01cally, the\none that corresponds to the update scheme of [24]). In other words, we will add a singleton tree.\nHowever, in practice our algorithm converges in a small number (\u226a m) of iterations and provides an\naccurate mixture of trees. In fact, in all our experiments we never obtained any singleton trees. We\nconclude the description of our method by noting that once the new tree distribution \u03b8Tt is obtained,\nthe value of \u03c3 is easily updated as \u03c3 = arg min\u03c3 \u03a6(y).\n6 Experiments\nWe present a comparison of our method with the state of the art algorithms. We also use it to learn\npictorial structures for face recognition. Note that our method is ef\ufb01cient in practice due to the\n\n6\n\n\fDataset\nAgaricus\nNursery\nSplice\n\nTANB\n\n100.0 \u00b1 0\n93.0 \u00b1 0\n94.9 \u00b1 0.9\n\nMF\n\n99.45 \u00b1 0.004\n98.0 \u00b1 0.01\n\n-\n\nTree\n\n98.65 \u00b1 0.32\n92.17 \u00b1 0.38\n95.7 \u00b1 0.2\n\nMT\n\n99.98 \u00b1 0.04\n99.2 \u00b1 0.02\n95.5 \u00b1 0.3\n\n[26] + MT\n100.0 \u00b1 0\n\n98.35 \u00b1 0.30\n95.6 \u00b1 0.42\n\nOur + MT\n100.0 \u00b1 0\n\n99.28 \u00b1 0.13\n96.1 \u00b1 0.15\n\nTable 1: Classi\ufb01cation accuracies for the datasets used in [21]. The \ufb01rst column shows the name of the dataset.\nThe subsequent columns show the mean accuracies and the standard deviation over 5 trials of tree-augmented\nnaive Bayes [10], mixture of factorial distributions [2], single tree classi\ufb01er [3], mixture of trees with random\ninitialization (i.e. the numbers reported in [21]), initialization with [26] and initialization with our approach.\nNote that our method provides similar accuracies to [21] while using a smaller mixture of trees (see text).\n\nspecial form of the Hessian matrix (for the log-barrier method) and the Chow-Liu algorithm [3, 21]\n(for the projection to tree distributions). In all our experiments, each iteration takes only 5 to 10\nminutes (and the number of iterations is equal to the number of trees in the mixture).\nComparison with Previous Work. As mentioned earlier, our approach can be used to obtain a\ngood initialization for the EM algorithm of [21] since it minimizes \u03b1-divergence (providing comple-\nmentary information to the KL-divergence used in [21]). This is in contrast to the random initial-\nizations used in the experiments of [21] or the initialization obtained by [26] (that also attempts to\nminimize the KL-divergence). We consider the task of using the mixture of trees as a classi\ufb01er, that\nis given training data that consists of feature vectors xi together with the class values ci, the task\nis to correctly classify previously unseen test feature vectors. Following the protocol of [21], this\ncan be achieved in two ways. For the \ufb01rst type of classi\ufb01er, we append the feature vector xi with\ni. We then learn a mixture of tree that predicts the\nits class value ci to obtain a new feature vector x\u2032\ni. Given a new feature vector x we assign it the class c that results in the highest\nprobability of x\u2032\nprobability. For the second type of classi\ufb01er, we learn a mixture of trees for each class value such\nthat it predicts the probability of a feature vector belonging to that particular class. Once again,\ngiven a new feature vector x we assign it the class c which results in the probability.\nWe tested our approach on the three discrete valued datasets used in [21]. In all our experiments,\nwe initialized the mixture with a single tree obtained from the Chow-Liu algorithm. We closely\nfollowed the experimental setup of [21] to ensure that the comparisons are fair. Table 1 provides the\naccuracy of our approach together with the results reported in [21]. For \u2018Splice\u2019 the \ufb01rst classi\ufb01er\nprovides the best results, while \u2018Agaricus\u2019 and \u2018Nursery\u2019 use the second classi\ufb01er. Note that our\nmethod provides similar accuracies to [21]. More importantly, it uses a smaller mixture of trees to\nachieve these results. Speci\ufb01cally, the method of [21] uses 12, 30 and 3 trees for the three datasets\nrespectively. In contrast our method uses 3-5 trees for \u2018Agaricus\u2019, 10-15 trees for \u2018Nursery\u2019 and 2\ntrees for Splice (where the number of trees in the mixture was obtained using a validation dataset,\nsee [21] for details). Furthermore, unlike [21, 26], we obtain better accuracies by using a mixture\nof trees instead of a single tree for the \u2018Splice\u2019 dataset. It is worth noting that [26] also provided a\nsmall set of initial trees (with comparable size to our method). However, since the trees do not cover\nthe entire observed distribution, their method provides less accurate results.\nFace Recognition. We tested our approach on the task of recognizing faces using the publicly\navailable dataset1 containing the faces of 11 characters in an episode of \u2018Buffy the Vampire Slayer\u2019.\nThe total number of faces in the dataset is 24,244. For each face we are provided with the location\nof 13 facial features (see Fig. 1). Furthermore, for each facial feature, we are also provided with\na vector that represents the appearance of that facial feature [5] (using the normalized grayscale\nvalues present in a circular region of radius 7 centered at the facial feature). As noted in previous\nwork [5, 18] the task is challenging due to large intra-class variations in expression and lighting\nconditions.\n\nGiven the appearance vector, the likelihood of each facial feature belonging to a particular character\ncan be found using logistic regression. However, the relative locations of the facial features also\noffer important cues in distinguishing one character from the other (e.g. the width of the eyes or the\ndistance between an eye and the nose). Typically, in vision systems, this information is not used.\nIn other words, the so-called bag of visual words model is employed. This is due to the somewhat\ncounter-intuitive observation made by several researchers that models that employ spatial prior on\nthe features, e.g. pictorial structures [6], often provide worse recognition accuracies than those that\nthrow away this information. However, this may be due to the fact that often the structure and\nparameters of pictorial structures and other related models are set by hand.\n\n1Available at http://www.robots.ox.ac.uk/\u02dcvgg/research/nface/data.html\n\n7\n\n\fFigure 1: The structure of the seven trees learned for 3 of the 11 characters using our method. The red squares\nshow the position of the facial features while the blue lines indicate the edges. The structure and parameters of\nthe trees vary signi\ufb01cantly, thereby indicating the multimodality of the observed distribution.\n\n0\n\n1\n\n2\n\n3\n\n[26]\nOur\n\n65.68% 66.05% 66.01% 66.01% 66.08% 66.08% 66.16% 66.20%\n65.68% 66.05% 66.65% 66.86% 67.25% 67.48% 67.50% 67.68%\n\n4\n\n5\n\n6\n\n7\n\nTable 2: Accuracy for the face recognition experiments. The columns indicate the size of the mixture, ranging\nfrom 0 (i.e. the bag of visual words model) to 7 (where the results saturate). Note that our approach, which\nminimizes the \u03b1-divergence, provides better results than the method of [26], which minimizes KL-divergence.\nIn order to test whether a spatial model can help improve recognition, we learned a mixture of trees\nfor each of the characters. The random variables of the trees correspond to the facial features and\ntheir values correspond to the relative location of the facial feature with respect to the center of the\nnose. The unary potentials of each random variable is speci\ufb01ed using the appearance vectors (i.e.\nthe likelihood obtained by logistic regression). In order to obtain the pairwise potentials (i.e. the\nstructure and parameters of the mixture of trees), the faces are normalized to remove global scaling\nand in-plane rotation using the location of the facial features. We use the faces found in the \ufb01rst 80%\nof the episode to learn the mixture of trees. The faces found in the remaining 20% of the episode\nwere used as test data. Splitting the dataset in this manner (i.e. a non-random split) ensures that we\ndo not have any trivial cases where a face found in frame t is used for training and a (very similar)\nface found in frame t + 1 is used for testing.\nFig. 1 shows the structure of the trees learned for 3 characters. The structures differ signi\ufb01cantly\nbetween characters, which indicates that different spatial priors are dominant for different characters.\nAlthough the structure of the trees for a particular character are similar, they vary considerably in\nthe parameters. This suggests that the distribution is in fact multimodal and therefore cannot be\nrepresented accurately using a single tree. Although vision researchers have tried to overcome this\nproblem by using more complex models, e.g. see [4], their use is limited by a lack of ef\ufb01cient\nlearning algorithms. Table 2 shows the accuracy of the mixture of trees learned by the method\nof [26] and our approach. In this experiment, re\ufb01ning the mixture of trees using the EM algorithm\nof [21] did not improve the results. This is due to the fact that the training and testing data differ\nsigni\ufb01cantly (due to non-random splits, unlike the previous experiments which used random splits of\nthe UCI datasets). In fact, when we split the face dataset randomly, we found that the EM algorithm\ndid help. However, classi\ufb01cation problems simulated using random splits of video frames are rare\nin real-world applications. Since [26] tries to minimize the KL divergence, it mostly tries to explain\nthe dominant mode of the observed distribution. This is evident in the fact that the accuracy of the\nmixture of trees does not increase signi\ufb01cantly as the size of the mixture increases (see table 2, \ufb01rst\nrow). In contrast, the minimization of \u03b1-divergence provides a diverse set of trees that attempt to\nexplain the entire distribution thereby providing signi\ufb01cantly better results (table 2, second row).\n7 Discussion\nWe formulated the problem of obtaining a small mixture of trees by minimizing the \u03b1-divergence\nwithin the fractional covering framework. Our experiments indicate that the suitably modi\ufb01ed frac-\ntional covering algorithm provides accurate models. We believe that our approach offers a natural\nframework for addressing the problem of minimizing \u03b1-divergence and could prove useful for other\nclasses of mixture models, for example mixtures of trees in log-probability space for which there\nexist several ef\ufb01cient and accurate inference algorithms [16, 27]. There also appears to be a connec-\ntion between fractional covering (proposed in the theory community) and Discrete AdaBoost [7, 9]\n(proposed in the machine learning community) that merits further exploration.\n\n8\n\n\fReferences\n\n[1] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[2] P. Cheeseman and J. Stutz. Bayesian classi\ufb01cation (AutoClass): Theory and results. In KDD,\n\npages 153\u2013180, 1995.\n\n[3] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees.\n\nIEEE Transactions on Information Theory, 14(3):462\u2013467, 1968.\n\n[4] D. Crandall, P. Felzenszwalb, and D. Huttenlocher. Spatial priors for parts-based recognition\n\nusing statistical models. In CVPR, 2005.\n\n[5] M. Everingham, J. Sivic, and A. Zisserman. Hello! My name is... Buffy - Automatic naming\n\nof characters in TV video. In BMVC, 2006.\n\n[6] M. Fischler and R. Elschlager. The representation and matching of pictorial structures. TC,\n\n22:67\u201392, January 1973.\n\n[7] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of Computer and System Sciences, 55(1):119\u2013139, 1997.\n\n[8] B. Frey, R. Patrascu, T. Jaakkola, and J. Moran. Sequentially \ufb01tting inclusive trees for inference\n\nin noisy-OR networks. In NIPS, 2000.\n\n[9] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of\n\nboosting. Annals of Statistics, 28(2):337\u2013407, 2000.\n\n[10] N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classi\ufb01ers. Machine Learning,\n\n29:131\u2013163, 1997.\n\n[11] S. Ioffe and D. Forsyth. Human tracking with mixtures of trees. In ICCV, pages 690\u2013695,\n\n2001.\n\n[12] S. Ioffe and D. Forsyth. Mixtures of trees for object recognition. In CVPR, pages 180\u2013185,\n\n2001.\n\n[13] Y. Jing, V. Pavlovic, and J. Rehg. Boosted bayesian network classi\ufb01ers. Machine Learning,\n\n73(2):155\u2013184, 2008.\n\n[14] S. Kirschner and P. Smyth. In\ufb01nite mixture of trees. In ICML, pages 417\u2013423, 2007.\n[15] J. Kleinberg and E. Tardos. Approximation algorithms for classi\ufb01cation problems with pair-\n\nwise relationships: Metric labeling and Markov random \ufb01elds. In STOC, 1999.\n\n[16] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. PAMI,\n\n2006.\n\n[17] M. P. Kumar and D. Koller. MAP estimation of semi-metric MRFs via hierarchical graph cuts.\n\nIn UAI, 2009.\n\n[18] M. P. Kumar, P. Torr, and A. Zisserman. An invariant large margin nearest neighbour classi\ufb01er.\n\nIn ICCV, 2007.\n\n[19] Y. Lin, S. Zhu, D. Lee, and B. Taskar. Learning sparse Markov network structure via ensemble-\n\nof-trees models. In AISTATS, 2009.\n\n[20] M. Meila and T. Jaakkola. Tractable Bayesian learning of tree belief networks. In UAI, 2000.\n[21] M. Meila and M. Jordan. Learning with a mixture of trees. JMLR, 1:1\u201348, 2000.\n[22] T. Minka. Divergence measures and message passing. Technical report, Microsoft Research,\n\n2005.\n\n[23] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.\n\nMorgan-Kauffman, 1988.\n\n[24] S. Plotkin, D. Shmoys, and E. Tardos. Fast approximation algorithms for fractional packing\n\nand covering problems. Mathematics of Operations Research, 20:257\u2013301, 1995.\n\n[25] A. Renyi. On measures of information and entropy. In Berkeley Symposium on Mathematics,\n\nStatistics and Probability, pages 547\u2013561, 1961.\n\n[26] S. Rosset and E. Segal. Boosting density estimation. In NIPS, 2002.\n[27] M. Wainwright, T. Jaakkola, and A. Willsky. A new class of upper bounds on the log partition\n\nfunction. IEEE Transactions on Information Theory, 51:2313\u20132335, 2005.\n\n9\n\n\f", "award": [], "sourceid": 820, "authors": [{"given_name": "M.", "family_name": "Kumar", "institution": null}, {"given_name": "Daphne", "family_name": "Koller", "institution": null}]}