{"title": "Learning Graphical Models with Mercer Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 1033, "page_last": 1040, "abstract": null, "full_text": "Learning Graphical Models\n\nwith Mercer Kernels\n\nFrancis R. Bach\n\nMichael I. Jordan\n\nDivision of Computer Science\n\nComputer Science and Statistics\n\nUniversity of California\n\nBerkeley, CA 94720\nfbach@cs.berkeley.edu\n\nUniversity of California\n\nBerkeley, CA 94720\n\njordan@cs.berkeley.edu\n\nAbstract\n\nWe present a class of algorithms for learning the structure of graphical\nmodels from data. The algorithms are based on a measure known as\nthe kernel generalized variance (KGV), which essentially allows us to\ntreat all variables on an equal footing as Gaussians in a feature space\nobtained from Mercer kernels. Thus we are able to learn hybrid graphs\ninvolving discrete and continuous variables of arbitrary type. We explore\nthe computational properties of our approach, showing how to use the\nkernel trick to compute the relevant statistics in linear time. We illustrate\nour framework with experiments involving discrete and continuous data.\n\n1 Introduction\n\nGraphical models are a compact and ef\ufb01cient way of representing a joint probability distri-\nbution of a set of variables. In recent years, there has been a growing interest in learning\nthe structure of graphical models directly from data, either in the directed case [1, 2, 3, 4]\nor the undirected case [5]. Current algorithms deal reasonably well with models involv-\ning discrete variables or Gaussian variables having only limited interaction with discrete\nneighbors. However, applications to general hybrid graphs and to domains with general\ncontinuous variables are few, and are generally based on discretization.\nIn this paper, we present a general framework that can be applied to any type of variable.\nWe make use of a relationship between kernel-based measures of \u201cgeneralized variance\u201d\nin a feature space, and quantities such as mutual information and pairwise independence in\nthe input space. In particular, suppose that each variable \u0002\u0001 in our domain is mapped into a\nhigh-dimensional space \u0003\n\u0001\t\b\n\u000b\u0001\r\f and consider the set of random\nvariables\nin feature space. Suppose that we compute the mean and covariance matrix\nof these variables and consider a set of Gaussian variables,\n\u0010 , that have the same mean\n\u0010 yields a\nand covariance. We showed in [6] that a canonical correlation analysis of\nmeasure, known as \u201ckernel generalized variance,\u201d that characterizes pairwise independence\namong the original variables\n\u0010 , and is closely related to the mutual information among\nthe original variables. This link led to a new set of algorithms for independent component\nanalysis. In the current paper we pursue this idea in a different direction, considering the\nuse of the kernel generalized variance as a surrogate for the mutual information in model\nselection problems. Effectively, we map data into a feature space via a set of Mercer\nkernels, with different kernels for different data types, and treat all data on an equal footing\n\n\u0001 via a map\n\n\u0001 . Let\n\n\u000e\u000f\u0005\u0013\u0012\n\n\u000e\u000f\u0005\u0002\u0012\n\n\u0001\u0007\u0006\n\n\u0001\u0011\u0010\n\n\u000e\u000f\u0005\n\n\u0014\u0001\n\n\u0004\n\u0005\n\u0004\n\u0001\n\u0001\n\u000e\n\fas Gaussian in feature space.\nWe brie\ufb02y review the structure-learning problem in Section 2, and in Section 4 and Sec-\ntion 5 we show how classical approaches to the problem, based on MDL/BIC and condi-\ntional independence tests, can be extended to our kernel-based approach. In Section 3 we\nshow that by making use of the \u201ckernel trick\u201d we are able to compute the sample covari-\nance matrix in feature space in linear time in the number of samples. Section 6 presents\nexperimental results.\n\n2 Learning graphical models\n\n\u0007\u001c\u001b\n\n\t\u000b\n\r\f\n\n\u0002\u0004\u0003\u0006\u0005\b\u0007\n\n\u0006\u000f\u000e\n\n\b\u0012\u0011\u0014\u0013\u0016\u0015\u000b\u0001\r\f , with \t\u0010\n\r\f\n\nis the set of parents of node \u0011\n\n\b\u0017\u0011\u0018\u0013\u0016\u0015\u000b\u0001\n\u0001 and the vector \n\nStructure learning algorithms generally use one of two equivalent interpretations of graphi-\ncal models [7]: the compact factorization of the joint probability distribution function leads\nto local search algorithms while conditional independence relationships suggest methods\nbased on conditional independence tests.\nLocal search. In this approach, structure learning is explicitly cast as a model selection\nproblem. For directed graphical models, in the MDL/BIC setting of [2], the likelihood is\npenalized by a model selection term that is equal to \ntimes the number of param-\neters necessary to encode the local distributions. The likelihood term can be decomposed\nand expressed as follows:\n\f ,\n\u0006\u001a\u0019\n\u001f\u001e! \n\t\u0010\n\r\f\nin the graph to be scored and \u001b\nis the\nwhere \u0015\n\b\n\n . These mutual in-\nempirical mutual information between the variable \nformation terms and the number of parameters for each local conditional distributions are\neasily computable in discrete models, as well as in Gaussian models. Alternatively, in a full\nBayesian framework, under assumptions about parameter independence, parameter modu-\nlarity, and prior distributions (Dirichlet for discrete networks, inverse Wishart for Gaussian\nnetworks), the log-posterior probability of a graph given the data can be decomposed in a\nsimilar way [1, 3].\nGiven that our approach is based on the assumption of Gaussianity in feature space, we\ncould base our development on either the MDL/BIC approach or the full Bayesian ap-\nproach. In this paper, we extend the MDL/BIC approach, as detailed in Section 4.\nConditional independence tests.\nIn this approach, conditional independence tests are\nperformed to constrain the structure of possible graphs. For undirected models, going\nfrom the graph to the set of conditional independences is relatively easy: there is an edge\n\u0001 and $\" are independent given all other variables [7].\nbetween \nIn Section 5, we show how our approach could be used to perform independence tests and\nlearn an undirected graphical model. We also show how this approach can be used to prune\nthe search space for the local search of a directed model.\n\nif and only if \n\n\u0001 and #\"\n\n\u000b\u0001\u001d\u0013\n\n3 Gaussians in feature space\n\nIn this section, we introduce our Gaussianity assumption and show how to approximate the\nmutual information, as required for the structure learning algorithms.\n\n3.1 Mercer Kernels\n\n\u001f-\n\npoints\n\n\u0013+*,*+*+\u0013\n\nA Mercer kernel on a space %\n\f from %\nis a function &\n\b\n'\u0013\u001d(\n, the \u0007/.0\u0007 matrix 1\nin %\n, de\ufb01ned by 1\nsemide\ufb01nite. The matrix 1\nis usually referred to as the Gram matrix of the points\n\f , it is possible to \ufb01nd a space \u0003\n\b\n'\u0013\u001d(\n\f and\n\nsuch that for any set of\n\f , is positive\n\u0001\u001d\u0013\n\b\n\n\u0010 .\nand a map\n,\nto \u0003\nfrom\n\f (see, e.g., [8]). The space\n\b\u0017(\nas the feature map. We will use\n\nis usually referred to as the feature space and the map\n\nGiven a Mercer kernel\nsuch that\n\nis the dot product in \u0003\n\nbetween\n\nto )\n\n2\u0013\u001d(\n\n\u0001\n\u0001\n\f\n\b\n\u0001\n\u0001\n\u0013\n\n\u001e\n \n\f\n\u001e\n\u0001\n\u000e\n\n\n\u0010\n\u0001\n\"\n\u0006\n&\n\n\"\n\u000e\n\n\u0001\n&\n\u0004\n%\n&\n\b\n\f\n\u0004\n\b\n\n\u0004\n\u0003\n\u0004\n\f.\n\n. We also use the\n\nto denote the representative of \n\n\f\u0011\u0010\n\nto denote the dot product of  and \u0003\nin the dual space of \u0003\n\u0013\b\u0007\n\u0013+*,*+*\n\u000e\u0006\u0005\nin the vector space )\n\nthe notation \u0002\u0001\u0004\u0003\nin feature space \u0003\nnotation \u0002\u0001\n\u0010 , we use the trivial kernel &\nFor a discrete variable which takes values in\n\t\u000b\n\r\f\u0002\u000e , which corresponds to a feature space of dimension \u0007 . The feature map is\n\n\u000f\f\n\f . Note that this mapping corresponds to the usual embedding of a multi-\nnomial variable of order \u0007\n\u0019 . The feature\nFor continuous variables, we use the Gaussian kernel\nspace has in\ufb01nite dimension, but as we will show, the data only occupy a small linear\nmanifold and this linear subspace can be determined adaptively in linear time. Note that\n, which corresponds to simply modeling the\n\n\u0006\u0013\u0012\u0015\u0014\u0011\u0016\n\n\u0013+*,*+*,\u0013\n\n\b\n'\u0013\u001d(\n\n\b\n2\u0013\u0016(\n\n\u000e\u0018\u0017\u001a\u0019\u001c\u001b\n\n\u0001\u001e\u001d\n\n.\n\nan alternative is to use the kernel &\n\ndata as Gaussian in input space.\n\n2\u0013\u001d(\n\n \u001f\n\n\u0013,*+*+*,\u0013\n\nrandom variables with values in spaces\n\nto each of the input spaces %\n\n\u0013,*+*,*+\u0013\n\u0001 , with feature space \u0003\n\n3.2 Notation\nLet \nbe !\na Mercer kernel &\n\u0001 . The random vector of feature images\nhas a covariance matrix %\nbetween\n\f . Let\n\f and\nvector with the same mean and covariance as\n\u0005\u0014\u0012\nused as the random vector on which the learning of graphical model structure is based.\n\u0010 , and are in-\nNote that the suf\ufb01cient statistics for this vector are\nherently pairwise. No dependency involving strictly more than two variables is modeled\nexplicitly, which makes our scoring metric easy to compute. In Section 6, we present em-\npirical evidence that good models can be learned using only pairwise information.\n\n. Let us assign\n\u0001 and feature map\n\f#\"\n\u0013+*,*+*,\u0013\n\" being the covariance matrix\n\f denote a jointly Gaussian\nwill be\n\f . The vector\n\n\u0013+*+*,*\nde\ufb01ned by blocks, with block %\n\u0013+*,*+*\n\u0013,*+*,*\n\n\b\n#\"\u000f\f\n\n\b\n$\u001f\n\n#\"\n\n\b\n\n\n\b\n\n\n\b\n\n\n\u0005)&\n\n\u0005\u0002&\n\n*+*+*\n\nis then equal to (\n\n\u0013+*,*+*\n\u001f-\nelements\n\u0010 have been centered, i.e., \u000e\n\n3.3 Computing sample covariances using kernel trick\n\u0010 of elements of\n\b\n\n\u0005)&\n\n. By mapping\nWe are given a random sample\ninto the feature spaces, we de\ufb01ne \u0007\n\f . We assume that for each \u0011\n\u0006\u0013' . The sample\nthe data in feature space\n\u0013+*+*,*+\u0013\ncovariance matrix (\n. Note that a Gaussian with\ncovariance matrix (\nhas zero variance along directions that are orthogonal to the images\nof the data. Consequently, in order to compute the mutual information, we only need to\ncompute the covariance matrix of the projection of\nonto the linear span of the data, that\nis, for all \u0011\u0014\u0013+*\n\u0013-,\u0006\u0013\u001e. :\n\u0005)/\n\u0005$0\n\u0005)/\nwhere \t\ndenotes the \u0007\n\u0005 vectors with only zeros except at position , , and 1\n\u0001 is the Gram\nmatrix of the centered points, the so-called centered Gram matrix of the \u0011 -th component,\nde\ufb01ned from the Gram matrix 2\n\u0001 of the original (non-centered) points as\n.\u001c\u0007 matrix composed of ones [8]. From Eq. (1), we\nis a \u0007\n-43\nsee that the sample covariance matrix of\n3.4 Regularization\n\nin the \u201cdata basis\u201d has blocks\n\n\f , where\n\n-43\n\n\u0005$0\n\n(1)\n\n\" .\n\n),\nWhen the feature space has in\ufb01nite dimension (as in the case of a Gaussian kernel on\nthen the covariance we are implicitly \ufb01tting with a kernel method has an in\ufb01nite number\nof parameters. In order to avoid over\ufb01tting and control the capacity of our models, we\n\n\f\n\u0006\n\u0004\n\b\n\n\f\n\u0006\n\b\n\t\n\n\t\n\u0010\n&\n\f\n\n\u0014\n\b\n\f\n\u0006\n\n(\n\n%\n\n%\n\u001f\n\u0001\n\u0004\n\u0005\n\u0006\n\b\n\u0005\n\n\u0013\n\u0005\n\u001f\n\b\n\u0004\n\n\b\n\n\n\f\n\u0004\n\u001f\n\f\n\f\n\u0001\n\u0005\n\u0001\n\u0006\n\u0004\n\u0001\n\u0001\n\u0005\n\"\n\u0006\n\u0004\n\"\n\b\n\u0005\n\u0012\n\u0006\n\b\n\u0005\n\u0012\n\n\u0013\n\u0005\n\u0012\n\u001f\n\u0005\n\u0006\n\b\n\u0005\n\n\u0013\n\u0005\n\u001f\n\u000e\n\u0004\n\u0001\n\u0001\n\f\n\u0013\n\u0004\n\u0001\n\u0001\n\f\n\u0004\n\"\n\u0001\n\u000e\n\n\n\u0013\n%\n\n.\n.\n%\n\u001f\n!\n\u0001\n\u0006\n\u0004\n\u0001\n&\n\u0001\n\u000e\n\u0005\n\n\u0001\n\u0005\n-\n\u0001\n-\n&\n\f\n\n\u0005\n&\n\u0001\n%\n\u0001\n\"\n%\n\u0001\n\"\n\u0006\n\n-\n\u000e\n-\n&\n\f\n\n\u0001\n\b\n\"\n\f\n\u0001\n%\n\u0005\n\b\n\u0001\n\f\n\u0001\n(\n%\n\u0001\n\"\n\"\n\u0006\n\u0005\n\u0007\n-\n1\n&\n\f\n\n\b\n\u0001\n\f\n\u0001\n\u0005\n&\n\u0001\n\b\n\u0005\n&\n\"\n\f\n\u0001\n\"\n\u0006\n\u0005\n\u0007\n-\n1\n&\n\f\n\n\b\n1\n\u0001\n\f\n/\n&\n\b\n1\n\"\n\f\n0\n&\n\u0006\n\u0005\n\u0007\n\t\n\u0001\n/\n1\n\u0001\n1\n\"\n\t\n0\n\u0013\n/\n.\n1\n\u0001\n\u0006\n\b\n\u001b\n\u0019\n\n\f\n2\n\u0001\n\b\n\u001b\n\u0019\n\n3\n\u0005\n\n-\n1\n\u0001\n1\n)\n\f\u0005\u0002\u0012\n\nan isotropic Gaussian with covariance \u0001\n\nby another Gaussian with small variance (for\nregularize by smoothing the Gaussian\nbe a small constant. We\nan alternative interpretation and further details, see [6]). Let\nadd to\nin an orthonormal basis. In the data\nbasis, the covariance of this Gaussian is exactly the block diagonal matrix with blocks\n\u0001 . Consequently, our regularized Gaussian covariance \u0002\nif\n\u001f1\n* and \u0002\n\u0011\u0004\u0003\n\f , which leads to a more compact correlation matrix\n\f-\u0014\n, with blocks \n\n.\nThese cross-correlation matrices have exact dimension \u0007\n, but since the eigenvalues of\n\u0001 are softly thresholded to zero or one by the regularization, the effective dimension is\n\u0001 will be used as the dimension of our\n\u0006\r\f\u000f\u000e\nGaussian variables for the MDL/BIC criterion, in Section 4.\n\nhas blocks \u0002\nis a small constant, we can use \u0002\n\n\f . This dimensionality \u0007\n\n\u0001 . Since\nfor \u0011\u000b\u0003\n\n\u001b , where \n\n* , and \n\n\u001f1\n\u0005\t\b\n\n\u0001\u0004\u0005\n\u001f1\n\n\u0001\u0007\u0006\n\n\f-\u0014\n\n\u0007 matrices would lead to algorithms that scale as \b\n\n3.5 Ef\ufb01cient implementation\nDirect manipulation of \u0007\n\u0007\u0011\u0010\n\f .\n.\nGram matrices, however, are known to be well approximated by matrices of low rank\nThe approximation is exact when the feature space has \ufb01nite dimension \u0007\n(e.g., with dis-\ncan be chosen less than \u0007 . In the case of continuous data with the\ncrete kernels), and\nGaussian kernel, we have shown that\ncan be chosen to be upper bounded by a constant\nindependent of \u0007\n[6]. Finding a low-rank decomposition can thus be done through incom-\nplete Cholesky decomposition in linear time in \u0007\n(for a detailed treatment of this issue,\nsee [6]).\nUsing the incomplete Cholesky decomposition, for each matrix\n\u0001 matrix with rank\nization\n, where \u0013\nto obtain an \u0007\nperform a singular value decomposition of \u0013\nonal columns (i.e., such that\n\u0001\u0017\u0006\u0018\u0013\n\u0001\u0019\u0013\nWe have \n\nobtained from the diagonal matrix\nelements. Thus\nbasis de\ufb01ned by the columns of the matrices\nthe various mutual information terms.\n\n\u0001 we obtain the factor-\n\u0001 , where\n. We\n\u0001\u0007\u0014\n\u0001 with orthog-\n\u0001 matrix\n\u0001 such that\n\u0001 diagonal matrix\nis the diagonal matrix\nto its\n\u001f \u001c\"!\nin the new\n\u0001 , and these blocks will be used to compute\n\n, where where \u001a\n\u0001 by applying the function\n\u001c\u001e\u001d\n\u0006#\u001a\n\n\u0001 has a correlation matrix with blocks \n\n\u001b ), and an\n\nis an \u0007\n\n\u0006\t\u0013\n\n\u0005\u0014\u0012\n\n\f-\u0014\n\n\u0001\u001b\u001a\n\n.\n\n3.6 KGV-mutual information\n\nWe now show how to compute the mutual information between\na link with the mutual information of the original variables \nLet (\nin terms of blocks\n\n\u0013\u0016(\n\u0013\u0016(!\"\nis equal to (see, e.g., [9]):\n\nbe !\n\njointly Gaussian random vectors with covariance matrix\n\n, de\ufb01ned\n\f . The mutual information between the variables\n\n, and we make\n\n\u0013+*,*+*,\u0013\n \u001f\n.\n\n\u0005\u0013\u0012\n\u0013,*+*,*\n\n\u0006&%\n\n\u0003('\n\n\b\u0017(\n\n\u0013+*,*+*\n\u0013+*+*,*+\u0013\u0016(\n\n(2)\n\n\u0002\u0004\u0003\u0006\u0005\ndenotes the determinant of the matrix\n\n\u0013+*,*+*\n\n\u0013\u0016(\n\n\b\u0012(\n\n)+*,*-*.)\n\n/0)\n\n\u0018\nwhere\n. The ratio of determinants in this ex-\npression is usually referred to as the generalized variance, and is independent of the basis\nwhich is chosen to compute\nFollowing Eq. (2), the mutual information between\nthe distribution of \n\n, which depends solely on\n\n, is equal to\n\n\u0013+*,*+*+\u0013\n\n\u0005\u0013\u0012\n\n.\n\n\u001b21\n\n\u0013+*+*,*+\u0013\n\n$\u001f\n\n\u0002\u0004\u0003\u0006\u0005\n\n)+*,*-*,)\n\n\u0016\n\n(3)\n\n\n\u0005\n\u0012\n\n\u001b\n\u0001\n%\n%\n\u0001\n\"\n\u0006\n\n-\n1\n\u0001\n1\n\"\n\u0006\n%\n\u0001\n\u0001\n\u0006\n\n-\n1\n\u0001\n\u0001\n\n%\n\u0001\n\n-\n\b\n1\n\u0001\n\u0005\n\u0007\n\n\u001b\n\f\n\u0001\n\u0006\n\n-\n1\n\u0001\n\u0001\n\u0005\n\u0001\n\u0001\n\b\n\n\u0001\n\n\u0001\n\"\n\u0006\n\n\u0001\n\n\"\n\u0006\n\u0001\n\u0001\n\u0006\n\u0001\n\u0006\n1\n\u0001\n\b\n1\n\u0001\n\u0005\n\u0007\n\n\u001b\n\n1\n\u0007\n\u0001\n\b\n1\n\u0001\n\b\n1\n\u0001\n\u0005\n\u0007\n\n\u001b\n\n.\n\b\n\u0012\n\u0012\n\u0012\n1\n1\n\u0001\n\u0001\n\u0013\n\u0001\n\u0001\n\u0001\n.\n\u0012\n\u0012\n\u0012\n\u0007\n\u0001\n.\n\u0012\n\u0015\n\u0015\n\u0001\n\u0001\n\u0015\n\u0001\n\u0006\n\u0012\n\u0001\n.\n\u0012\n\u0016\n1\n\u0001\n\u0001\n\u0006\n\u0015\n\u0001\n\u0016\n\u0001\n\u0015\n\u0001\n\u0001\n\u0001\n\u0006\n\b\n1\n\u0001\n\u0005\n\u0007\n\n\u001b\n\n1\n\u0001\n\u0006\n\u0015\n\u0001\n\u0015\n\u0001\n\u0001\n\u0001\n\u0016\n\b\n\u001c\n\u0005\n\u0007\n\n\f\n\u0001\n\"\n\u0001\n\u0015\n\u0001\n\u0001\n\u0015\n\"\n\u001a\n\"\n\u0015\n\n\u0005\n\u0012\n\u001f\n\n\u0013\n\n\u001f\n$\n$\n\u0001\n\"\n\u0001\n(\n\n\u001f\n\u001b\n\n\u001f\n\f\n\u0006\n\u0019\n\u0005\n\u0001\n)\n$\n)\n)\n$\n$\n\u001f\n\u001f\n)\n\u0013\n)\n/\n$\n\n\u0005\n\u0012\n\u001f\n\b\n\n\n\f\n\u0006\n\u0019\n\u0005\n\u0001\n)\n\n)\n)\n\n\u001f\n\u001f\n)\n*\n\f\u0013\u0001\n\n\u0013,*+*,*\n\naccordingly.\n\n-mutual information (KGV stands for kernel gen-\nWe refer to this quantity as the\neralized variance). It is always nonnegative and can also be de\ufb01ned for partitions of the\nvariables into subsets, by simply partitioning the correlation matrix \n\nThe KGV has an interesting relationship to the mutual information among the original\nvariables, \n. In particular, as shown in [6], in the case of two discrete variables,\nthe KGV is equal to the mutual information up to second order, when expanding around the\nmanifold of distributions that factorize in the trivial graphical model (i.e. with independent\ncomponents). Moreover, in the case of continuous variables, when the width \u0002 of the\n\nGaussian kernel tend to zero, the KGV necessarily tends to a limit, and also provides a\nsecond-order expansion of the mutual information around independence.\nThis\ninformation might also provide a useful,\ncomputationally-tractable surrogate for the mutual information more generally, and in par-\nticular substitute for mutual information terms in objective functions for model selection,\nwhere even a rough approximation might suf\ufb01ce to rank models. In the remainder of the\npaper, we investigate this possibility empirically.\n\nthe KGV-mutual\n\nsuggests\n\nthat\n\n4 Structure learning using local search\nIn this approach, an objective function \t\u0004\u0003\ndirected graphical model \u0013\nGaussian variables is easily derived. Let \u0015\u0014\u0001\nWe have\n\b\u0017\u0011\u0014\u0013\u001d\u0015\n\n\u0006\u0005\n\u0001\b\u0007\n\t\u000b\u001e\n\b\u0012\u0011\u0014\u0013\u001d\u0015\n \u000e\u000b\n\n) measures the goodness of \ufb01t of the\n, and is minimized. The MDL/BIC objective function for our\n\f be the set of parents of node \u0011\nin \u0013\n.\n\f , with\n(4)\nwhere \u0007\u0010\u001e! \n\f , we are faced with an NP-\nhard optimization problem on the space of directed acyclic graphs [10]. Because the score\ndecomposes as a sum of local scores, local greedy search heuristics are usually exploited.\nWe adopt such heuristics in our simulations, using hillclimbing. It is also possible to use\nMarkov-chain Monte Carlo (MCMC) techniques to sample from the posterior distribution\n\f within our framework; this would in principle allow\n\n\" . Given the scoring metric \t\n\nus to output several high-scoring networks.\n\nde\ufb01ned by \u0011\n\n\f\u0013\u0012\u0004\u0014\u0010\u0015\u0017\u0016\n\n\b\u0019\u0013\n\u0001\b\u0007\r\t\u000b\u001e\n\n\u0002\u0004\u0003\u0006\u0005\n\n\u001e! \n\n\"\u0010\u000f\n\n \f\u000b\n\n\b\u0019\u0013\n\n\b\u0019\u0013\n\n\b\u001b\u0013\n\n\b\u001d\u0019\n\n\b\u001b\u0013\n\n5 Conditional independence tests using KGV\n\n\f\u0010\u0019\n\n2\u0013\u001d(\n\nIn this section, we indicate how conditional independence tests can be performed using the\nKGV, and show how these tests can be used to estimate Markov blankets of nodes.\nLikelihood ratio criterion.\nIn the case of marginal independence, the likelihood ratio\ncriterion is exactly equal to a power of the mutual information (see, e.g, [11] in the case\nof Gaussian variables). This generalizes easily to conditional independence, where the\nis equal to\nis the number of samples and the mutual\n\nlikelihood ratio criterion to test the conditional independence of ( and \u0018 given \n\u0014\u0010\u0015\u0017\u0016\u001a\u0019\n\b\n2\u0013\ninformation terms are computed using empirical distributions.\nApplied to our Gaussian variables\n, we obtain a test statistic based on linear combination\nof KGV-mutual information terms: \u001b\n\f . Theoretical thresh-\n2\u0013\u001d(\nold values exist for conditional independence tests with Gaussian variables [7], but instead,\n\u000e and\nwe prefer to use the value given by the MDL/BIC criterion, i.e.,\n! are the dimensions of the Gaussians), so that the same decision regarding conditional\nindependence is made in the two approaches (scoring metric or independence tests) [12].\nMarkov blankets. For Gaussian variables, it is well-known that some conditional indepen-\ndencies can be read out from the inverse of the joint covariance matrix [7]. More precisely,\n\n\u0007\"! (where \u0007\n\n\f\u001c\u001b , where \u0007\n\n2\u0013\n\u0001\u001e\u001d\n\u001f\n \n\n2\u0013\u0016(\n\n2\u0013\u001d(\n\n\f\u0010\u0019\n\n1\n\n\u0013\n\n\u001f\n\u0013\n\u001d\n\u001f\n\u0006\n\u0015\n\u0001\n\t\n\f\n\u0006\n\u000e\n\u0001\n\t\n\u0001\n\t\n\u0001\n\f\n\u0006\n\u0007\n\u0001\n\u0002\n\u0003\n\u0005\n)\n\u0005\n \n)\n)\n\n\u001e\n\u001e\n \n)\n)\n\n\u0001\n\u000b\n\u0001\n)\n\u0005\n\u0007\n\u0007\n\u0001\n\u0001\n\u0007\n\u0013\n\u0006\n\u000e\n\u001e\n \n\u0016\n\u0012\n\u0017\n\u0007\n)\n\u001a\n\t\n\f\n\u0019\n\u0007\n\b\n\u001b\n\b\n\u0013\n\u0018\n\u001b\n\b\n\u001b\n\u0018\n\f\n\u0005\n\u0012\n1\n\b\n\u0013\n\u0018\n\f\n\u0019\n\u001b\n1\n\b\n\f\n\u0019\n\u001b\n1\n\b\n\u0018\n\n-\n-\n\u0007\n\u000e\n\u0007\n\f\u0003('\n\n\u0006\u001e%\n\n\b\u0017(\n\nare !\n\n\f , then (\n\f of\n\nde\ufb01ned in terms of blocks\n\njointly Gaussian random vectors with dimensions \u0007\n\nIf (\n\u0013+*,*+*+\u0013\u0016(\nance matrix\n\u0013\u001d(!\"\ngiven all the other variables if and only if the block \b\u0012\u0011\u0014\u0013+*\nwith the threshold value \u0010\n\n\u0001 , and with covari-\n\u0001 and (!\" are independent\nis equal to zero.\n,\n1\u0004\u0003\u0005\u0003\n\u0005\u0002\u0001\n.\nand for all pairs of nodes, we can \ufb01nd an undirected\n\nThus in the sample case, we can read out the edges of the undirected model directly from\nusing the test statistic\n\u001f\r \nApplied to the variables (\ngraphical model in polynomial time, and thus a set of Markov blankets [4].\nWe may also be interested in constructing a directed model from the Markov blankets;\nhowever, this transformation is not always possible [7]. Consequently, most approaches\nuse heuristics to de\ufb01ne a directed model from a set of conditional independencies [4, 13].\nAlternatively, as a pruning step in learning a directed graphical model, the Markov blanket\ncan be safely used by only considering directed models whose moral graph is covered by\nthe undirected graph.\n\n1\u0007\u0006\t\b\n\u0003\u0005\u0003\n\n1\u0004\u0003\n\n\u0005\u000b\u0012\n\n6 Experiments\n\n'$*\n\n\u0005 ).\n\n\u0005 and \u0002\n\n\u0005 ), one using the MDL/BIC metric\n\nWe compare the performance of three hillclimbing algorithms for directed graphical mod-\nels, one using the KGV metric (with\nof [2] and one using the BDe metric of [1] (with equivalent prior sample size \u0007\u000b\n\nWhen the domain includes continuous variables, we used two discretization strategies; the\n\ufb01rst one is to use K-means with a given number of clusters, the second one uses the adaptive\ndiscretization scheme for the MDL/BIC scoring metric of [14]. Also, to parameterize the\nlocal conditional probabilities we used mixture models (mixture of Gaussians, mixture\nof softmax regressions, mixture of linear regressions), which provide enough \ufb02exibility\nat reasonable cost. These models were \ufb01tted using penalized maximum likelihood, and\ninvoking the EM algorithm whenever necessary. The number of mixture components was\nless than four and determined using the minimum description length (MDL) principle.\nWhen the true generating network is known, we measure the performance of algorithms by\nthe KL divergence to the true distribution; otherwise, we report log-likelihood on held-out\ntest data. We use as a baseline the log-likelihood for the maximum likelihood solution to a\nmodel with independent components and multinomial or Gaussian densities as appropriate\n(i.e., for discrete and continuous variables respectively).\n\n'$\u0013\n\n\u001b . We generated \u0007\n\n\u0005 point to node !\n\n\u0005 parents, we set\n\n\f by sampling uniformly at random in \u0019\n\nToy examples. We tested all three algorithms on a very simple generative model on !\nbinary nodes, where nodes \u0005\n. For each assignment ( of the\nthrough !\n\b\n \u001f\n\u001b . We also\nstudied a linear Gaussian generative model with the identical topology, with regression\nweights chosen uniformly at random in \u0019\nWe report average results (over 20 replications) in Figure 1 (left), for !\nto \u0005\nsimilar, degrading slightly as !\ndiscretization methods degrade signi\ufb01cantly as !\n\n' samples.\nranging from \u0005\n' . We see that on the discrete networks, the performance of all three algorithms is\nincreases. On the linear networks, on the other hand, the\nincreases. The KGV approach is the\nonly approach of the three capable of discovering these simple dependencies in both kinds\nof networks.\nDiscrete networks. We used three networks commonly used as benchmarks 1, the ALARM\nnetwork (37 variables), the I NSURANCE network (27 variables) and the H AILFINDER net-\nwork (56 variables). We tested various numbers of samples \u0007\n. We performed 40 repli-\ncations and report average results in Figure 1 (right). We see that the performance of our\nmetric lies between the (approximate Bayesian) BIC metric and the (full Bayesian) BDe\n\n1Available at http://www.cs.huji.ac.il/labs/compbio/Repository/.\n\n\n\u001f\n$\n$\n\u0001\n\"\n\u0001\n1\n\u0006\n$\n\u0014\n\n1\n\n\u0001\n\"\n\u0006\n\u0019\n\n\u0001\n\u0002\n\u0003\n\u0014\n \n \n \n1\n \n\u0003\n\u0001\n\u0001\n1\n\u0001\n \n\u0010\n\u0003\n\u0001\n\u001d\n-\n-\n\u0001\n\u0006\n\u0001\n\n\u0006\n'\n\u0006\n\u0006\n\u0019\n!\n\u0019\n\f\n\u0006\n\u0005\n)\n(\n\u0005\n\u0019\n\u0005\n\u0013\n\u0005\n\u0006\n\u0005\n'\n'\n\f1\n\n0.5\n\n0\n\n1\n\n0.5\n\n0\n\nNetwork N (\nALARM\n\n2\n\n4\n\n6\n\n8\n\n10\n\nm \n\nINSURANCE\n\nHAILFINDER\n\n2\n\n4\n\n6\n\n8\n\n10\n\nm \n\n\u0001\u0003\u0002\u0005\u0004 ) BIC\n0.85\n0.5\n0.42\n1\n0.17\n4\n0.04\n16\n0.5\n1.84\n0.93\n1\n0.27\n4\n0.05\n16\n2.98\n0.5\n1\n1.70\n0.63\n4\n16\n0.25\n\nBDe KGV\n0.66\n0.47\n0.39\n0.25\n0.15\n0.07\n0.06\n0.02\n0.92\n1.53\n0.83\n0.52\n0.40\n0.15\n0.19\n0.04\n2.99\n2.29\n1.32\n1.77\n0.63\n0.48\n0.17\n0.32\n\nFigure 1: (Top left) KL divergence vs. size of discrete network !\n\n: KGV (plain), BDe\n(dashed), MDL/BIC (dotted). (Bottom left) KL divergence vs. size of linear Gaussian\nnetwork: KGV (plain), BDe with discretized data (dashed), MDL/BIC with discretized\ndata (dotted x), MDL/BIC with adaptive discretization (dotted +). (Right) KL divergence\nfor discrete network benchmarks.\n\nNetwork\nABALONE\nVEHICLE\nPIMA\nAUSTRALIAN\nBREAST\nBALANCE\nHOUSING\nCARS1\nCLEVE\nHEART\n\nN D\n1\n1\n1\n9\n1\n1\n1\n1\n8\n9\nTable 1: Performance for hybrid networks.\nare the number of discrete and continuous variables, respectively. The best performance in\neach row is indicated in bold font.\n\nis the number of samples, and \u001a\n\nKGV\n11.16\n22.71\n3.30\n5.40\n15.04\n1.88\n14.16\n6.85\n2.68\n1.32\n\nand %\n\n4175\n846\n768\n690\n683\n625\n506\n392\n296\n270\n\nC d-5\n8\n18\n8\n6\n10\n4\n13\n7\n6\n5\n\n10.68\n21.92\n3.18\n5.26\n15.00\n1.97\n14.71\n6.93\n2.66\n1.34\n\nd-10\n10.53\n21.12\n3.14\n5.11\n15.03\n2.03\n14.25\n6.58\n2.57\n1.36\n\nmetric. Thus the performance of the new metric appears to be competitive with standard\nmetrics for discrete data, providing some assurance that even in this case pairwise suf\ufb01-\ncient statistics in feature space seem to provide a reasonable characterization of Bayesian\nnetwork structure.\nHybrid networks. It is the case of hybrid discrete/continuous networks that is our principal\ninterest\u2014in this case the KGV metric can be applied directly, without discretization of the\ncontinuous variables. We investigated performance on several hybrid datasets from the\nUCI machine learning repository, dividing them into two subsets, 4/5 for training and 1/5\nfor testing. We also log-transformed all continuous variables that represent rates or counts.\nWe report average results (over 10 replications) in Table 1 for the KGV metric and for the\nBDe metric\u2014continuous variables are discretized using K-means with 5 clusters (d-5) or\n10 clusters (d-10). We see that although the BDe methods perform well in some problems,\ntheir performance overall is not as consistent as that of the KGV metric.\n\n7 Conclusion\n\nWe have presented a general method for learning the structure of graphical models, based\non treating variables as Gaussians in a high-dimensional feature space. The method seam-\nlessly integrates discrete and continuous variables in a uni\ufb01ed framework, and can provide\n\n\n\u0007\n\fimprovements in performance when compared to approaches based on discretization of\ncontinuous variables.\nThe method also has appealing computational properties; in particular, the Gaussianity as-\nsumption enables us make only a single pass over the data in order to compute the pairwise\nsuf\ufb01cient statistics. The Gaussianity assumption also provides a direct way to approxi-\nmate Markov blankets for undirected graphical models, based on the classical link between\nconditional independence and zeros in the precision matrix.\nWhile the use of the KGV as a scoring metric is inspired by the relationship between the\nKGV and the mutual information, it must be emphasized that this relationship is a local one,\nbased on an expansion of the mutual information around independence. While our empir-\nical results suggest that the KGV is also an effective surrogate for the mutual information\nmore generally, further theoretical work is needed to provide a deeper understanding of the\nKGV in models that are far from independence.\nFinally, our algorithms have free parameters, in particular the regularization parameter and\nthe width of the Gaussian kernel for continuous variables. Although the performance is\nempirically robust to the setting of these parameters, learning those parameters from data\nwould not only provide better and more consistent performance, but it would also provide\na principled way to learn graphical models with local structure [15].\n\nAcknowledgments\n\nThe simulations were performed using Kevin Murphy\u2019s Bayes Net Toolbox for MATLAB.\nWe would like to acknowledge support from NSF grant IIS-9988642, ONR MURI N00014-\n00-1-0637 and a grant from Intel Corporation.\n\nReferences\n[1] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combina-\n\ntion of knowledge and statistical data. Machine Learning, 20(3):197\u2013243, 1995.\n\n[2] W. Lam and F. Bacchus. Learning Bayesian belief networks: An approach based on the MDL\n\nprinciple. Computational Intelligence, 10(4):269\u2013293, 1994.\n\n[3] D. Geiger and D. Heckerman. Learning Gaussian networks. In Proc. UAI, 1994.\n[4] J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000.\n[5] S. Della Pietra, V. J. Della Pietra, and J. D. Lafferty. Inducing features of random \ufb01elds. IEEE\n\nTrans. PAMI, 19(4):380\u2013393, 1997.\n\n[6] F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of Machine\n\nLearning Research, 3:1\u201348, 2002.\n\n[7] S. L. Lauritzen. Graphical Models. Clarendon Press, 1996.\n[8] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels. MIT Press, 2001.\n[9] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley & Sons, 1991.\n[10] D. M. Chickering. Learning Bayesian networks is NP-complete. In Learning from Data: Arti-\n\n\ufb01cial Intelligence and Statistics 5. Springer-Verlag, 1996.\n\n[11] T. W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley & Sons, 1984.\n[12] R. G. Cowell. Conditions under which conditional independence and scoring methods lead to\n\nidentical selection of Bayesian network models. In Proc. UAI, 2001.\n\n[13] D. Margaritis and S. Thrun. Bayesian network induction via local neighborhoods. InAdv. NIPS\n\n12, 2000.\n\n[14] N. Friedman and M. Goldszmidt. Discretizing continuous attributes while learning Bayesian\n\nnetworks. In Proc. ICML, 1996.\n\n[15] N. Friedman and M. Goldszmidt. Learning Bayesian networks with local structure. In Learning\n\nin Graphical Models. MIT Press, 1998.\n\n\f", "award": [], "sourceid": 2293, "authors": [{"given_name": "Francis", "family_name": "Bach", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}