{"title": "Copula Bayesian Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 559, "page_last": 567, "abstract": "We present the Copula Bayesian Network model for representing multivariate continuous distributions. Our approach builds on a novel copula-based parameterization of a conditional density that, joined with a graph that encodes independencies, offers great flexibility in modeling high-dimensional densities, while maintaining control over the form of the univariate marginals. We demonstrate the advantage of our framework for generalization over standard Bayesian networks as well as tree structured copula models for varied real-life domains that are of substantially higher dimension than those typically considered in the copula literature.", "full_text": "Copula Bayesian Networks\n\nGal Elidan\n\nDepartment of Statistics\n\nHebrew University\n\nJerusalem, 91905, Israel\n\ngalel@huji.ac.il\n\nAbstract\n\nWe present the Copula Bayesian Network model for representing multivariate\ncontinuous distributions, while taking advantage of the relative ease of estimat-\ning univariate distributions. Using a novel copula-based reparameterization of a\nconditional density, joined with a graph that encodes independencies, our model\noffers great \ufb02exibility in modeling high-dimensional densities, while maintaining\ncontrol over the form of the univariate marginals. We demonstrate the advantage\nof our framework for generalization over standard Bayesian networks as well as\ntree structured copula models for varied real-life domains that are of substantially\nhigher dimension than those typically considered in the copula literature.\n\nIntroduction\n\n1\nMultivariate real-valued distributions are of paramount importance in a variety of \ufb01elds ranging from\ncomputational biology and neuro-science to economics to climatology. Choosing and estimating a\nuseful form for the marginal distribution of each variable in the domain is often a straightforward\ntask. In contrast, aside from the normal representation, few univariate distributions have a conve-\nnient multivariate generalization. Indeed, modeling and estimation of \ufb02exible (skewed, multi-modal,\nheavy tailed) high-dimensional distributions is still a formidable challenge.\nCopulas [23] offer a general framework for constructing multivariate distributions using any given\n(or estimated) univariate marginals and a copula function C that links these marginals. The impor-\ntance of copulas is rooted in Sklar\u2019s theorem [29] that states that any multivariate distribution can\nbe represented as a copula function of its marginals. The constructive converse is important from a\nmodeling perspective as it allows us to separate the choice of the marginals and that of the depen-\ndence structure which is expressed in C. We can, for example, robustly estimate marginals using\na non-parametric approach, and then use only few parameters to capture the dependence structure.\nThis can result in a model that is easier to estimate and less prone to over-\ufb01tting than a fully non-\nparametric one, while at the same time avoiding the limitations of a fully parameterized distribution.\nIn practice, copula constructions often lead to signi\ufb01cant improvement in density estimation. Ac-\ncordingly, there has been a dramatic growth of academic and practical interest in copulas in recent\nyears, with applications ranging from mainstream \ufb01nancial risk assessment and actuarial analysis\n(e.g., Embrechts et al. [7]) to off-shore engineering (e.g., Accioly and Chiyoshi [2]).\nDespite the generality of the framework, constructing high-dimensional copulas is dif\ufb01cult, and\nmuch of the research involves only the bivariate case. Several works have attempted to overcome\nthis dif\ufb01culty by suggesting innovative ways in which bivariate copulas can be combined to form\nworkable copulas of higher dimensions. These attempts, however, are either limited to hierarchical\n[26] or mixture of trees [14] compositions, or rely on a recursive construction of conditional bivariate\ncopulas [1, 3, 17] that is somewhat elaborate for high dimensions.\nIn practice, applications are\nalmost always limited to a modest (< 10) number of variables (see Section 6 for further discussion).\nBayesian networks (BNs) [25] offer a markedly different approach for representing multivariate\ndistributions. In this widely used framework, a graph structure encodes independencies which imply\na decomposition of the joint density into local terms (the density of each variable conditioned on its\n\n1\n\n\fparents). This decomposition in turn facilitates ef\ufb01cient probabilistic computation and estimation,\nmaking the framework amenable to high-dimensional domains. However, the expressiveness of\nthese models is hampered by practical considerations that almost always lead to the the reliance\non simple parametric forms. Speci\ufb01cally, non-parametric variants of BNs (e.g., [9, 27]) typically\ninvolve elaborate training setups with a running time that grows unfavorably with the number of\nsamples and local graph connectivity. Furthermore, aside from the case of the normal distribution,\nthe form of the univariate marginal is neither under control nor is it typically known.\nOur goal is to construct \ufb02exible multivariate continuous distributions that maintain desired marginals\nwhile accommodating tens and hundreds of variables, or more. We present Copula Bayesian Net-\nworks (CBNs), an elegant marriage between the copula and the Bayesian network frameworks.1 As\nin BNs, we make use of a graph to encode independencies that are assumed to hold. Differently,\nwe rely on local copula functions and an explicit globally shared parameterization of the univariate\ndensities. This allows us to retain the \ufb02exibility of BNs, while offering control over the form of the\nmarginals, resulting in substantially improved multivariate densities (see Section 7 for a discussion\nof the related works of Kirshner [14] and Liu et al. [20]).\nAt the heart of our approach is a novel reparameterization of a conditional density using a copula\nquotient. With this construction, we prove a parallel to the BN factorization theorem: a decomposi-\ntion of the joint density according to the structure of the graph implies a decomposition of the joint\ncopula. Conversely, a product of local copula-based quotient terms is a valid multivariate copula.\nThis result provides us with a \ufb02exible modeling tool where joint densities are constructed via a com-\nposition of local copulas and marginal densities. Importantly, the construction also allows us to use\nstandard BN machinery for estimation and structure learning. Thus, our model opens the door for\n\ufb02exible explorative learning of high-dimensional models that retain desired marginal characteristics.\nWe learn the structure and parameters of a CBN for three varied real-life domains that are of a\nsigni\ufb01cantly higher dimension than typically reported in the copula literature. Using standard copula\nfunctions, we show that in all cases our approach leads to consistent and signi\ufb01cant improvement in\ngeneralization when compared to standard BN models as well as a tree-structured copula model.\n\n2 Copulas\nLet X = {X1, . . . , XN} be a \ufb01nite set of real-valued random variables and let FX (x) \u2261 P (X1 \u2264\nx1, . . . , Xn \u2264 xN ) be a (cumulative) distribution function over X , with lower case letters denoting\nassignment to variables. By slight abuse of notation, we use F(xi) \u2261 F (Xi \u2264 xi, XX /Xi = \u221e)\nand f(xi) \u2261 fXi(xi), and similarly for sets of variables f (y) \u2261 fY(y). A copula function [23, 29]\nlinks marginal distributions to form a multivariate one. Formally,\nDe\ufb01nition 2.1: Let U1, . . . , UN be real random variables marginally uniformly distributed on [0, 1].\nA copula function C : [0, 1]N \u2192 [0, 1] is a joint distribution function\n\nC(u1, . . . , uN ) = P (U1 \u2264 u1, . . . , UN \u2264 uN )\n\nCopulas are important because of the following seminal result\nTheorem 2.2:\nrandom variables, then there exists a copula function such that\n\n[Sklar 1959] Let F (x1, . . . , xN ) be any multivariate distribution over real-valued\n\nFurthermore, if each F(xi) is continuous then C is unique.\n\nF (x1, . . . , xN ) = C(F(x1), . . . , F(xN )).\n\nThe constructive converse which is of central interest from a modeling perspective is also true: since\nfor any random variable the cumulative distribution F(xi) is uniformly distributed on [0, 1], any\ncopula function taking the marginal distributions {F(xi)} as its arguments, de\ufb01nes a valid joint\ndistribution with marginals F(xi). Thus, copulas are \u201cdistribution-generating\u201d functions that allow\nus to separate the choice of the univariate marginals and that of the dependence structure expressed\nin the copula function C, often resulting in an effective real-valued construction.2.\n\n1A preliminary draft of this paper appeared as a technical report. A companion paper [6] addresses the\n\nquestion of performing approximate inference in Copula Bayesian networks.\n\n2Copulas can also be de\ufb01ned given non-continuous marginals and for ordinal random variables. These\n\nextensions are orthogonal to our work and to maintain clarity we focus here on the continuous case\n\n2\n\n\fFigure 1: Samples from the 2-\ndimensional normal copula den-\nsity using a correlation ma-\ntrix with a unit diagonal and\nan off-diagonal coef\ufb01cient of\n0.25. (left) with zero mean and\nunit variance normal marginals;\n(right) with a mixture of two\nGaussians marginals.\n\nNormal(1, 1) marginals\n\nMix of Gaussians marginals\n\nTo derive the joint density f (x) = \u2202N F (x)\nfrom the copula construction, assuming F has N-order\n\u2202x1...\u2202xN\npartial derivatives (true almost everywhere when F is continuous), and using the chain rule, we have\n\n(cid:89)\n\n(cid:89)\n\nf (x) =\n\n\u2202N C(F(x1), . . . , F(xN ))\n\n\u2202F(x1) . . . \u2202F(xN )\n\nf(xi) = c(F(x1), . . . , F(xN ))\n\nf(xi),\n\n(1)\n\ni\n\ni\n\nwhere c(F(x1), . . . , F(xN )), is called the copula density function. Eq. (1) will be of central use in\nthis paper as we will directly model joint densities.\n\nExample 2.3: A simple copula widely explored in the \ufb01nancial community is the Gaussian copula\nconstructed directly by inverting Sklar\u2019s theorem [7]\n\n(cid:0)\u03a6\u22121(F(x1)), . . . , \u03a6\u22121(F(xN ))(cid:1) ,\n\nC({F(xi)}) = \u03a6\u03a3\n\n(2)\nwhere \u03a6 is the standard normal distribution and \u03a6\u03a3 is the zero mean normal distribution with cor-\nrelation matrix \u03a3. To get a sense of the power of copulas, Figure 1 shows samples generated from\nthis copula using two different families of univariate marginals. More generally and without added\ncomputational dif\ufb01culty, we can also mix and match marginals of different forms.\n\n3 Copula Bayesian Networks (CBNs)\nAs in the copula framework, our goal is to model real-valued multivariate distributions while taking\nadvantage of the relative ease of one dimensional estimation. To cope with high-dimensional do-\nmains, as in BNs, we would also like to utilize independence assumptions encoded by a graph. To\nachieve this goal, we will construct multivariate copulas that are a composition of local copulas that\nfollow the structure of the graph. We start with the building block of our construction.\n\n3.1 Copula Parameterization of The Conditional Density\nAs in the BN framework, the building block of our model will be a local conditional density. We\nstart with a parameterization of such a density using copulas:\nLemma 3.1: Let f (x | y), with y = {y1, . . . , yK}, be a conditional density function and let f (x) be\nthe marginal density of X. Then there exists a copula density function c(F(x), F(y1), . . . , F(yK))\nsuch that\n\nf (x | y) = Rc(F(x), F(y1), . . . , F(yK))f (x)\n\nwhere Rc is the ratio\nRc(F(x), F(y1), . . . , F(yK)) \u2261\n\n(cid:82) c(F(x), F(y1), . . . , F(yK))f (x)dx\n\nc(F(x), F(y1), . . . , F(yK))\n\n=\n\nc(F(x), F(y1), . . . , F(yK))\n\n\u2202K C(1,F(y1),...,F(yK ))\n\n,\n\n\u2202F(y1)...\u2202F(yK )\n\nand where Rc is de\ufb01ned to be 1 when y = \u2205. The converse is also true, for any copula density\nfunction c, Rc(F(x), F(y1), . . . , F(yK))f (x) de\ufb01nes a valid conditional density function.\n\nBefore proving this result,\ndenominator\n\n(right-most\n\nit\nterm)\n\n(cid:82) c(F(x), F(y1), . . . , F(yK))f (x)dx. Recall that c() is itself an N-order derivative of the cop-\n\nto understand why the derivative form of\nthan the standard normalization integral\n\nis important\nis more useful\n\nula function so computing our denominator is no more dif\ufb01cult than computing c(). Indeed, for\nthe majority of existing copula functions, both have an explicit form. In contrast, the integral term\ndepends both on the copula form and the univariate marginal, and is generally dif\ufb01cult to compute.\n\n3\n\n(cid:239)1012345(cid:239)1012345(cid:239)202468(cid:239)4(cid:239)202468\fProof: From the basic properties of cumulative distribution functions, we have that for any copula\nfunction C(1, F(y1), . . . , F(yK)) = F (y1, . . . , yk) and thus, using the derivative chain rule,\n\n\u2202KC(1, F(y1), . . . , F(yK))\n\nf (y) =\n\nc(F(x), F(y1), . . . , F(yK))f (x)(cid:81)\n\nFrom Eq. (1) we have that\n\n\u2202y1, . . . , yK\n\nf (x | y) =\n\nf (x, y1, . . . , yK)\n\nf (y)\n\n=\n\n=\n\n\u2202KC(1, F(y1), . . . , F(yK))\n\n(cid:89)\nc(F(x), F(y1), . . . , F(yK))f (x)(cid:81)\n\n\u2202F(y1) . . . \u2202F(yK)\n\nk\n\n\u2202K C(1,F(y1),...,F(yK ))\n\n\u2202F(y1)...\u2202F(yK )\n\nk f (yk)\n\n(cid:81)\n\nf (yk).\n\nk f (yk)\n\nthere exists a copula density for which f (x, y1, . . . , yK) =\n\nk f (yk). It follows that there exists a copula for which\n\n=\n\nc(F(x), F(y1), . . . , F(yK))f (x)\n\n\u2202K C(1,F(y1),...,F(yK ))\n\n\u2202F(y1)...\u2202F(yK )\n\n\u2261 Rc(F(x), F(y1), . . . , F(yK))f (x)\n\nAs in Sklar\u2019s theorem and Eq. (1), the converse follows easily by reversing the arguments.\nThe implications of this result will underlie our construction: any copula density function\nc(x, y1, . . . , yK), together with f (x), can be used to parameterize a conditional density f (x | y).\n\nconditional densities fX (x) = (cid:81)\n\n3.2 Decomposition of The Joint Copula\nLet G be a directed acyclic graph whose nodes correspond to the random variables X , and let Pai =\n{Pai1, . . . , Paiki} be the parents of Xi in G. G encodes the independence statements I(G) =\n{(Xi \u22a5 NonDescendantsi | Pai)}, where NonDescendantsi are nodes that are non-descendants\nof Xi in G. We say that fX (x) decomposes according to G if it can be written as a product of\ni f (Xi | Pai). It can be shown that if f decomposes according\nto G then I(G) hold in fX (x). The converse is also true: if I(G) hold in fX (x) then the density\ndecomposes according to G (see [16], theorems 3.1 and 3.2). These results form the basis for the\nBN model [25] where a joint density is constructed via a composition of local conditional densities.\nWe now show that similar results hold for a multivariate copula. This in turn will provide the basis\nfor our construction of the CBN model.\nTheorem 3.2 : Decomposition. Let G be a directed acyclic graph over X , and let fX (x) be\ni f(xi), with fX (x)\nstrictly positive for all values of X . If fX (x) decomposes according to G then the copula density\nc(F(x1), . . . , F(xN )) also decomposes according to G\n\nparameterized via a joint copula density fX (x) = c(F(x1), . . . , F(xN ))(cid:81)\n\nc(F(x1), . . . , F(xN )) =\n\nRci(F(xi),{F(paik)}),\n\nwhere ci is a local copula that depends only on the value of Xi and its parents in G.\nProof: Using the positivity assumption, we can rearrange Eq. (1) to get c(F(x1), . . . , F(xN )) =\nf (x)(cid:81)\n(cid:81)\ni f(xi). From Lemma 3.1 and the decomposition of f (x) we have\n(cid:81)\ni f (xi | pai)\n(cid:81)\n(cid:89)\ni Rci(F(xi),{F(paik)})f(xi)\n\nc(F(x1), . . . , F(xN )) =\n\nf (x)(cid:81)\n\ni f(xi)\n\ni f(xi)\n\n=\n\nRci (F(xi),{F(paik)})\n\n=\n\n(cid:81)\n\ni f(xi)\n\n=\n\ni\n\n(cid:89)\n\ni\n\nThe constructive converse that is of central interest here is also true:\n\nComposition. Let G be a directed acyclic graph over X .\n\nTheorem 3.3 :\nIn addition, let\n{ci(F(xi), F(pai1), . . . , F(paiki))} be a set of strictly positive copula densities associated with\nthe nodes of G that have at least one parent. If I(G) hold then the function\nRci(F(xi),{F(paik)}),\n\ng(F(x1), . . . , F(xN )) =\n\n(cid:89)\n\nis a valid copula density c(F(x1), . . . , F(xN )) over X .\n\ni\n\n4\n\n\fThis above theorem can be proved directly via induction or using our reparameterization lemma\nand standard BN results. It is important to note that the local copulas do not need to agree on the\nnon-univariate marginals of overlapping variables. This is a result of the fact that each copula ci\nonly appears as part of a quotient term which is used to parameterize a conditional density. This\ngives us the freedom to mix and match local copulas of different types. Equally important is the\nfact that aside from the univariate densities, we do not need to concern ourselves with any marginal\nconstraints when estimating the parameters of these local copulas functions.\n\n3.3 A Multivariate Copula Model\nWe are now ready to construct a joint density given univariate marginals by properly composing\nlocal terms and without worrying about global coherence:\nDe\ufb01nition 3.4: A Copula Bayesian Network (CBN) is a triplet C = (G, \u0398C, \u0398f ) that encodes the\njoint density fX (x). \u0398C is a set of local copula densities functions ci(F(xi),{F(paik)}) that are\nassociated with the nodes of G that have at least one parent. \u0398f is the set of parameters representing\nthe marginal densities f(xi). fX (x) is parameterized as\n\n(cid:89)\n\nfX (x) =\n\nRci(F(xi),{F(paik)})f(xi).\n\nUsing our previous developments and applying Eq. (1) to fX (x), we have:\n\ni\n\nCorollary 3.5: A Copula Bayesian Network de\ufb01nes a valid joint density fX (x) whose marginal\ndistributions are parameterized by \u0398f and where the independence statements I(G) hold.\nThe main difference between the CBN model and a regular BN, aside from a novel choice for the\nlocal conditional parameterization, is in the shared global component that has the explicit semantics\nof the univariate marginals. Concretely, the CBN model allows us to decompose the problem of\nrepresenting a multivariate distribution with given (or estimated) univariate marginals into many\nlocal problems that, depending on the structure of G, can be substantially smaller in dimension.\nFor each family of Xi and its parents we are still faced with the problem of choosing an appropriate\nlocal copula. In this work we simply limit ourselves to copulas that have convenient multivariate\nform, but any of the recently suggested methods for constructing multivariate copulas functions (see\nSection 6) can also be used.\nIn either case, limiting ourselves to a smaller number of variables\n(a node and its parents) makes the construction of the local copula substantially easier than the\nconstruction of the full copula over X . Importantly, as in the case of BNs, our construction of a\njoint copula density that decomposes over the graph structure G also facilitates ef\ufb01cient parameter\nestimation and model selection (structure learning), as we brie\ufb02y discuss in the next section.\n\n4 Learning\nAs in the case of BNs, the product form of our CBN facilitates relatively ef\ufb01cient estimation and\nmodel selection. The machinery is standard and only brie\ufb02y described below.\n\nParameter Estimation\nGiven a complete dataset D of M instances where all of the variables X are observed in each\ninstance, the log-likelihood of the data given a CBN model C is\n\n(cid:96)(D : C) =(cid:80)M\n\n(cid:80)\n\ni log f (xi[m]) +(cid:80)M\n\nm=1\n\n(cid:80)\n\nm=1\n\ni log Ri(F(xi)[m], F(pai1[m]), . . . , F(paiki[m]))\nWhile this objective appears to fully decompose according to the structure of G, each marginal\ndistribution F(xi) actually appears in several local copula terms (of Xi and its children in G). To\nfacilitate ef\ufb01cient estimation, we adopt the common approach where the marginals are estimated\n\ufb01rst [13]. Given F(xi), we can then estimate the parameters of each local copula independently of\nthe others. We estimate the univariate densities using a standard normal kernel-based approach [24].\nIn this work we consider two of the simplest and most commonly used copula functions. For Frank\u2019s\nArchimedean copula C(u1, . . . , uN ) = \u2212 1\nthe Gaussian copula (see Section 2) with a uniform correlation parameter, we \ufb01nd the maximum\n\ni(e\u2212\u03b8F(xi) \u2212 1)/(e\u2212\u03b8 \u2212 1)N\u22121(cid:1) , and for\n\n\u03b8 log(cid:0)1 +(cid:81)\n\n5\n\n\fWine Train\n\nDow Jones Train\n\nCrime Train\n\nWine Test\n\nDow Jones Test\n\nCrime Test\n\nFigure 2: Train and test set performance for the 12 variable Wine, 28 variable Dow Jones and 100 variables\nCrime datasets. Models compared: Sigmoid BN; CBN with a uniform correlation normal copula (single\nparameter); CBN with a full normal copula (0.5 \u2217 d(d \u2212 1) parameters); CBN with Frank\u2019s single parameter\ncopula. Shown is the 10-fold average log-probability per instance (y-axis) vs. the maximal number of parents\nallowed in the network (x-axis). Error bars (slightly shifted for readability) show the 10 \u2212 90% range. The\nstructure for all models was learned with the same search procedure using the BIC model selection score.\n\nlikelihood parameters using a standard conjugate gradient algorithm. For the Gaussian copula with\na full covariance matrix, a reasonably effective and substantially more ef\ufb01cient method is based on\nthe relationship between the copula function and Kendall\u2019s Tau dependence measure [19]. For lack\nof space, further details for both of these copulas are provided in the supplementary material.\n\nModel Selection\nVery brie\ufb02y, to learn the structure of G, we use a standard score-based approach that starts\nwith the empty network, and greedily advances via local modi\ufb01cations to the current structure\n(add/delete/reverse edge). The search is guided by the Bayesian information criterion [28] that bal-\n2 log(M )|\u0398G|,\nances the likelihood of the model and its complexity score(G : D) = (cid:96)(D : \u02c6\u03b8,G)\u2212 1\nwhere \u02c6\u03b8 are the maximum-likelihood parameters, and |\u0398G| is the number of free parameters asso-\nciated with the graph structure G. During the search, we also use a TABU list and random restarts\n[10] to mitigate the problem of local maxima. See Koller and Friedman [16] for more details.\n\n5 Experimental Evaluation\nWe assess the effectiveness of our approach for density estimation by comparing CBNs and BNs\nlearned from training data in terms of log-probability performance on test data. For BNs, we use a\nlinear Gaussian conditional density and a non-linear Sigmoid one (see Koller and Friedman [16]).\nFor CBNs, to demonstrate the \ufb02exibility of our framework, we consider the three local copula func-\ntions discussed in Section 4: fully parametrized Normal copula; the same copula with a single cor-\nrelation parameter and unit diagonal (UnifCorr); Frank\u2019s single parameter Archimedean copula.\nWe use standard normal kernel density estimation for the univariate densities. The structure of both\nthe BN and CBN models was learned using the same greedy structure search procedure described in\nSection 4. We consider three datasets of a markedly different nature and dimensionality:\n\u2022 Wine Quality (UCI repository). 11 physiochemical properties and a sensory quality variable for\n\nthe red Portuguese \u201dVinho Verde\u201d wine [4]. Included are measurements from 1599 tastings.\n\n6\n\n(cid:239)0.500.511.522.533.544.5(cid:239)11(cid:239)10(cid:239)9(cid:239)8(cid:239)7(cid:239)6(cid:239)5(cid:239)4(cid:239)3Maximum number of parents10(cid:239)fold train log(cid:239)probability / instance(cid:239)0.500.511.522.533.544.5(cid:239)32(cid:239)30(cid:239)28(cid:239)26(cid:239)24(cid:239)22(cid:239)20(cid:239)18(cid:239)16Maximum number of parents(cid:239)0.500.511.522.533.544.520406080100120140160180200Maximum number of parents Sigmoid BNKernel(cid:239)Gaussian CBNKernel(cid:239)UnifCorr CBNKernel(cid:239)Frank\u2019s CBNNormal(cid:239)UnifCorr CBN(cid:239)0.500.511.522.533.544.5(cid:239)11(cid:239)10(cid:239)9(cid:239)8(cid:239)7(cid:239)6(cid:239)5(cid:239)4(cid:239)3Maximum number of parents10(cid:239)fold test log(cid:239)probability / instance(cid:239)0.500.511.522.533.544.5(cid:239)32(cid:239)30(cid:239)28(cid:239)26(cid:239)24(cid:239)22(cid:239)20(cid:239)18Maximum number of parents(cid:239)0.500.511.522.533.544.520406080100120140160180200Maximum number of parents Sigmoid BNKernel(cid:239)Gaussian CBNKernel(cid:239)UnifCorr CBNKernel(cid:239)Frank\u2019s CBNNormal(cid:239)UnifCorr CBN\fComparison\nFigure 3:\nof\nthe number of edges\nlearned in the different\nrandom run for different\nmodels (y-axis) vs. the Sig-\nmoid BN model (x-axis),\nwhen the maximal number\nof parents in the network\nwas limited to 4.\n\nWine dataset\n\nCrime dataset\n\n\u2022 Dow Jones. 2001-2005 (1508 trading days) daily adjusted changes of the 30 index stocks. To\navoid arbitrary imputation, two stocks not traded in all of these days were excluded (KFT,TRV).\n\u2022 Crime (UCI repository). 100 observed variables relating to crime ranging from household size to\n\nfraction of children born outside of a marriage, for 1994 communities across the U.S.\n\nFigure 2 compares average log-probability (y-axis) for 10 random equal train/test splits as a function\nof the maximal number of parents allowed in the network (x-axis). Results for the linear Gaussian\nBN were almost identical to those of the sigmoid BN for the Wine and Dow Jones datasets and\ninferior for the Crime dataset, and are omitted for clarity. For all datasets, the copula based models\noffer a clear gain in training performance as well as in generalization on unseen test instances.\nRemarkably, the single parameter (for each local density) UnifCorr model is superior to the BN\nmodel even when the latter utilizes up to 8 local parameters (with 4 parents). In fact, even Frank\u2019s\nsingle parameter Archimedean copula which is constrained by the fact that all of its K-marginals are\nequal [23], is superior to the BN model. Importantly, the advantage of the CBN model is signi\ufb01cant\nas the units of improvement are in bits/instance. That is, an improvement of 2 bits/instance translates\ninto each test instance being, on average, four times as likely.3 It is also important to note the bene\ufb01t\nthat comes with structures that are richer than a tree. As the number of allowed parents (x-axis) is\nincreased, gains are relatively small when the dimensionality of the domain is limited (12 variables);\nThe gains are, however, quite substantial for the more complex domains.\nTo understand the role of the univariate marginals, we start with the no dependency network (0\non x-axis), where the advantage of CBNs is solely due to the use of \ufb02exible univariate marginals.\nSurprisingly, even with single parameter copulas, although much simpler than the Sigmoid form\nused for the BN model, we are able to maintain much of that advantage as the model becomes\nmore complex. As expected, this is not the case when we constrain the CBN model to have normal\nmarginals (Normal-UnifCorr) and when the domain is suf\ufb01ciently complex (Crime).\nTo get a sense of the overall dependency structure, Figure 3 shows the number of edges learned for\nthe different models. For the Wine dataset, the linear BN attempts to compensate for its constrained\nform by using substantially more edges than the non-linear Sigmoid BN. The Kernel-UnifCorr\nCBN, in contrast, tends to use less edges while achieving higher test performance. Finally, the\nNormal-UnifCorr CBN model, despite the forced normal marginals, does not lead to overly com-\nplex structures as it is constrained by the simplicity of the copula function (single parameter). For\nthe challenging Crime dataset, the differences are more pronounced: both the linear and non-linear\nBN models almost saturate the limit of 4 parents per variable, while the Kernel-UnifCorr copula\nmodel requires, on average, less than half the number of parents to achieve superior performance.\nFinally, in Figure 4, we demonstrate the qualitative advantage of CBNs by comparing empirical\nvalues from the test data (left) with samples generated from the different models. For the \u2019physical\ndensity\u2019 and \u2019alcohol\u2019 variables (top), the CBN samples (middle) are better than the BN ones (right),\nbut not dramatically so. However, for the \u2019residual sugar\u2019 and \u2019physical density\u2019 pair (bottom), where\nthe empirical dependence is far from normal, the advantage of the CBN representation is clear. We\nrecall that the CBN model uses a simple normal copula so that the advantage is solely rooted in the\ndistortion of the input to the copula created by the kernel-based univariate representation. With more\nexpressive copulas we can expect further qualitative and quantitative advantages.\n\n3Note that the performance for the crime domain is on an unusually high scale since some of the variables\nare closely correlated, leading to peaked densities. We emphasize that this does not effect the relative merit of\na method - an advantage of a bit/instance still translates to each instance being, on average, twice as likely.\n\n7\n\n1416182022242628303214161820222426283032# edges in Sigmoid BN# edges in competitor180200220240260280300320340360180200220240260280300320340360# edges in Sigmoid BN# edges in competitor Kernel(cid:239)UnifCorr CBNGaussian BNNormal(cid:239)UnifCorr CBN\fEmpirical\n\nCBN Samples\n\nBN Samples\n\nFigure 4: Demonstration of the depen-\ndency learned for the Wine dataset for\ntwo variable pairs. Compared is the\nempirical distribution in the test data\n(left) with samples generated from the\nlearned CBN (middle) and BN (right)\nmodels. To eliminate the effect of dif-\nferences in structure, the CBN model\nwas forced to use the structure learned\nfor the BN model which contains the\nnetwork fragment \u2019residual sugar\u2019 \u2192\n\u2019physical density\u2019 \u2192 \u2019alcohol level\u2019.\n\n6 Related Work\nFor lack of space we do not discuss direct multivariate copula constructions (e.g., [8, 15, 18, 22]) that\nare typically effective only for few dimensions, and focus on composite constructions that build on\nsmaller (bivariate) copulas. The Vine model [3] relies on a recursive construction of bivariate copulas\nto parameterize a multivariate one. Although it uses a graphical representation, the framework is\ninherently different from ours: conditional independence is replaced with a conditional dependence\nwhose parameters depend on the conditioning variable(s). Kurwicka and Cooke [17] reveal a direct\nconnection between vines and belief networks, but that is limited to the scenario of elliptical bivariate\ncopulas. Relying on the same representation, Aas et al. [1] suggest an alternative construction\nmethodology. While the vine representation is certainly general, the need to condition on many\nvariables using a somewhat elaborate construction limits practical applications to a modest number\nof variables. Aas et al. [1] do note the simpli\ufb01cation that can result from making independence\nassumptions, but do not provide a general framework for doing so. Savu and Trede [26] suggest an\nalternative model that is limited to a hierarchical tree structure of bivariate Archimedean copulas.\nKirshner [14] uses the copula product operator of Darsow et al. [5] to suggest a mixture of trees\nmodel that is directly motivated by the \ufb01eld of graphical models. The relationship between our\nmodel to theirs is the same as that of a general BN to a mixture of trees model [21]. Most recently,\nLiu et al. [20] consider a general sparse undirected copula-based model that is focused on the semi\nand non-parametric aspect of modeling, and is speci\ufb01c to the case of the normal copula.\nFinally, it is important to put the dimension of the domains we consider in this work (up to 100\nvariables) in perspective. Copula applications are numerous yet most are limited to a relatively\nsmall number (< 10) of variables. Heinen and Alfonso [11] are unique in that they consider 95\nvariables, but using an approach that is tailored to the speci\ufb01c details of the GARCH model.\n\n7 Discussion and Future Work\nWe presented Copula Bayesian Networks, a marriage between the Bayesian network and copula\nframeworks. Building on a novel reparameterization of the conditional density, our model offers\ngreat \ufb02exibility in modeling high-dimensional continuous distribution while offering control over\nthe form of the univariate marginals. We applied our approach to three markedly different real-life\ndatasets and, in all cases, demonstrated a consistent and signi\ufb01cant generalization advantage.\nOur contribution is threefold. First, our framework allows us to \ufb02exibly \u201cmix and match\u201d local\ncopulas and univariate densities of any form. Second, like BNs, we allow for independence as-\nsumptions that are more expressive than those possible with tree-based constructions, leading to\ngeneralization advantages. Third, we leverage on existing machinery to perform model selection in\nsigni\ufb01cantly higher dimensions than typically considered in the copula literature. Thus, our work\nopens the door for numerous applications where the \ufb02exibility of copulas is needed but could not\nbe previously utilized. In a companion paper [6], we also show that CBNs give rise to an ef\ufb01cient\ninference procedure.\nThe gap between train and test performance for CBNs motivates the development of model selection\nscores tailored to the copula framework (e.g., based on rank correlation). It would also be interesting\nto see if our framework can be adapted to the cumulative scenario, while allowing for independencies\nquite different from the recently introduced cumulative network model [12].\n\n8\n\n0.990.99511.00589101112131415Physical densityAlcohol level0.990.99511.005891011121314Physical densityAlcohol level0.9850.990.99511.0051.017891011121314Physical densityAlcohol level02468101214160.990.99511.005Residual sugarPhysical density02468101214160.990.99511.005Residual sugarPhysical density(cid:239)1012345670.9850.990.99511.0051.01Residual sugarPhysical density\fAcknowledgements\nI am grateful to Ariel Jaimovich, Amir Globerson, Nir Friedman and Fabio Spizzichino for their\ncomments on earlier drafts of this manuscript. G. Elidan was supported by the Alon fellowship.\nReferences\n[1] K. Aas, C. Czado, A. Frigessi, and H. Bakken. Pair-copula constructions of multiple dependencies.\n\nInsurance: Mathematics and Economics, 44:182\u2013198, 2009.\n\n[2] R. Accioly and F. Chiyoshi. Modeling dependence with copulas: a useful tool for \ufb01eld development\n\ndecision process. Journal of Petroleum Science and Engineering, 44:83\u201391, 2004.\n\n[3] T. Bedford and R. Cooke. Vines - a new graphical model for dependent random variables. Annals of\n\nStatistics, 30(4):1031\u20131068, 2002.\n\n[4] P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis. Modeling wine preferences by data mining\n\nfrom physicochemical properties. Decision Support Systems, 47(4):547\u2013553, 2009.\n\n[5] W. Darsow, B. Nguyen, and E. Olsen. Copulas and Markov processes. Illinois J Math, 36:600\u2013642, 1992.\n[6] G. Elidan. Inference-less density estimation using Copula Bayesian Networks. In Uncertainty in Arti\ufb01cial\n\nIntelligence (UAI), 2010.\n\n[7] P. Embrechts, F. Lindskog, and A. McNeil. Modeling dependence with copulas and applications to risk\n\nmanagement. Handbook of Heavy Tailed Distributions in Finance, 2003.\n\n[8] M. Fischer and C. Kock. Constructing and generalizing given multivariate copulas. Technical report,\n\nWorking paper, University of Erlangen-Nurnberg, Nurnberg, 2007.\n\n[9] N. Friedman and I. Nachman. Gaussian Process Networks. In Uncertainty in AI (UAI), 2000.\n[10] F. Glover and M. Laguna. Tabu search. In C. Reeves, editor, Modern Heuristic Techniques for Combina-\n\ntorial Problems, Oxford, England, 1993. Blackwell Scienti\ufb01c Publishing.\n\n[11] A. Heinen and A. Alfonso. Asymmetric CAPM dependence for large dimensions: The canonical vine\n\nautoregressive copula model. ECORE Discussion Paper, 2008.\n\n[12] J. Huang and B. Frey. Cumulative distribution networks and the derivative-sum-product algorithm. In\n\nUncertainty in Arti\ufb01cial Intelligence (UAI), 2008.\n\n[13] H. Joe and J. Xu. The estimation method of inference functions for margins for multivariate models.\n\nTechnical Report 166, Department of Statistics, University of British Columbia, 1996.\n\n[14] S. Kirshner. Learning with tree-averaged densities and distributions. In Neural Information Processing\n\nSystems (NIPS), 2007.\n\n[15] K. Koehler and J. Symanowski. Constructing multivariate distributions with speci\ufb01c marginal distribu-\n\ntions. Journal of Multivariate Distributions, 55:261\u2013282, 1995.\n\n[16] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT, 2009.\n[17] D. Kurwicka and R. Cooke. The vine copula method for representing high dimensional dependent distri-\n\nbutions: Applications to continuous belief nets. In The Winter Simulation Conference, 2002.\n\n[18] E. Liebscher. Modelling and estimation of multivariate copulas. Technical report, Working paper, Uni-\n\nversity of Applied Sciences, Merseburg, 2006.\n\n[19] F. Lindskog, A. McNeil, and U. Schmock. Kendall\u2019s tau for elliptical distributions. Credit Risk - mea-\n\nsurement, evaluation and management, pages 149\u2013156, 2003.\n\n[20] H. Liu, J. Lafferty, and L. Wasserman. The nonparanormal: Semiparametric estimation of high dimen-\n\nsional undirected graphs. Journal of Machine Learning Research, 10:22952328, 2010.\n\n[21] M. Meila and M. Jordan. Estimating dependency structure as a hidden variable. In Neural Information\n\nProcessing Systems (NIPS), 1998.\n\n[22] P. Morillas. A method to obtain new copulas from a given one. Metrika, 61:169\u2013184, 2005.\n[23] R. Nelsen. An Introduction to Copulas. Springer, 2007.\n[24] E. Parzen. On estimation of a probability density function and mode. Annals of Mathematical Statistics,\n\n33:1065\u20131076, 1962.\n\n[25] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.\n[26] C. Savu and M. Trede. Hierarchical archimedean copulas. In the Conf on High Frequency Finance, 2006.\n[27] A. Schwaighofer, M. Dejori, V. Tresp, and M. Stetter. Structure Learning with Nonparametric Decom-\n\nposable Models. In the International Conference on Arti\ufb01cial Neural Networks, 2007.\n\n[28] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461\u2013464, 1978.\n[29] A. Sklar. Fonctions de repartition a n dimensions et leurs marges. Publications de l\u2019Institut de Statistique\n\nde L\u2019Universite de Paris, 8:229\u2013231, 1959.\n\n9\n\n\f", "award": [], "sourceid": 155, "authors": [{"given_name": "Gal", "family_name": "Elidan", "institution": null}]}