{"title": "Discovering Hidden Variables in Noisy-Or Networks using Quartet Tests", "book": "Advances in Neural Information Processing Systems", "page_first": 2355, "page_last": 2363, "abstract": "We give a polynomial-time algorithm for provably learning the structure and parameters of bipartite noisy-or Bayesian networks of binary variables where the top layer is completely hidden. Unsupervised learning of these models is a form of discrete factor analysis, enabling the discovery of hidden variables and their causal relationships with observed data. We obtain an efficient learning algorithm for a family of Bayesian networks that we call quartet-learnable, meaning that every latent variable has four children that do not have any other parents in common. We show that the existence of such a quartet allows us to uniquely identify each latent variable and to learn all parameters involving that latent variable. Underlying our algorithm are two new techniques for structure learning: a quartet test to determine whether a set of binary variables are singly coupled, and a conditional mutual information test that we use to learn parameters. We also show how to subtract already learned latent variables from the model to create new singly-coupled quartets, which substantially expands the class of structures that we can learn. Finally, we give a proof of the polynomial sample complexity of our learning algorithm, and experimentally compare it to variational EM.", "full_text": "Discovering Hidden Variables in Noisy-Or Networks\n\nusing Quartet Tests\n\nYacine Jernite, Yoni Halpern, David Sontag\n\nCourant Institute of Mathematical Sciences\n\nNew York University\n\n{halpern, jernite, dsontag}@cs.nyu.edu\n\nAbstract\n\nWe give a polynomial-time algorithm for provably learning the structure and pa-\nrameters of bipartite noisy-or Bayesian networks of binary variables where the\ntop layer is completely hidden. Unsupervised learning of these models is a form\nof discrete factor analysis, enabling the discovery of hidden variables and their\ncausal relationships with observed data. We obtain an ef\ufb01cient learning algorithm\nfor a family of Bayesian networks that we call quartet-learnable. For each latent\nvariable, the existence of a singly-coupled quartet allows us to uniquely identify\nand learn all parameters involving that latent variable. We give a proof of the poly-\nnomial sample complexity of our learning algorithm, and experimentally compare\nit to variational EM.\n\n1\n\nIntroduction\n\nWe study the problem of discovering the presence of latent variables in data and learning models\ninvolving them. The particular family of probabilistic models that we consider are bipartite noisy-or\nBayesian networks where the top layer is completely hidden. Unsupervised learning of these models\nis a form of discrete factor analysis and has applications in sociology, psychology, epidemiology,\neconomics, and other areas of scienti\ufb01c inquiry that need to identify the causal relationships of\nhidden or latent variables with observed data (Saund, 1995; Martin & VanLehn, 1995). Furthermore,\nthese models are widely used in expert systems, such as the QMR-DT network for medical diagnosis\n(Shwe et al. , 1991). The ability to learn the structure and parameters of these models from partially\nlabeled data could dramatically increase their adoption.\nWe obtain an ef\ufb01cient learning algorithm for a family of Bayesian networks that we call quartet-\nlearnable, meaning that every latent variable has a singly-coupled quartet (i.e.\nfour children of\na latent variable for which there is no other latent variable that is shared by at least two of the\nchildren). We show that the existence of such a quartet allows us to uniquely identify each latent\nvariable and to learn all parameters involving that latent variable. Furthermore, using a technique\nintroduced by Halpern & Sontag (2013), we show how to subtract already learned latent variables\nto create new singly-coupled quartets, substantially expanding the class of structures that we can\nlearn. Importantly, even if we cannot discover every latent variable, our algorithm guarantees the\ncorrectness of any latent variable that was discovered. We show in Sec. 4 that our algorithm can\nlearn nearly all of the structure of the QMR-DT network for medical diagnosis (i.e., discovering the\nexistence of hundreds of diseases) simply from data recording the symptoms of each patient.\nUnderlying our algorithm are two new techniques for structure learning. First, we introduce a quartet\ntest to determine whether a set of binary variables is singly-coupled. When singly-coupled variables\nare found, we use previous results in mixture model learning to identify the coupling latent variable.\nSecond, we develop a conditional point-wise mutual information test to learn parameters of other\nchildren of identi\ufb01ed latent variables. We give a self-contained proof of the polynomial sample\n\n1\n\n\fFigure 1: Left: Example of a quartet-learnable network. For this network, the order (X, Y, Z) satis\ufb01es the def-\ninition: {a, b, c, d} is singly coupled by X, {c, e, f, g} is singly coupled by Y given X and {d, g, h, i} is singly\ncoupled by Z given X, Y . Right: Example of two different networks that have the same observable moments\n(i.e., distribution on a, b, c). pX = 0.2, pY = 0.3, pZ = 0.37. fX = (0.1, 0.2, 0.3), fY = (0.6, 0.4, 0.5),\nfZ = (0.28, 0.23, 0.33). The noise probabilities and full moments are given in the supplementary material.\n\ncomplexity of our structure and parameter learning algorithms, by bounding the error propagation\ndue to \ufb01nding roots of polynomials. Finally, we present an experimental comparison of our structure\nlearning algorithm to the variational expectation maximization algorithm of \u02c7Singliar & Hauskrecht\n(2006) on a synthetic image-decomposition problem and show competitive results.\nRelated work. Martin & VanLehn (1995) study structure learning for noisy-or Bayesian networks,\nobserving that any two observed variables that share a hidden parent must be correlated. Their al-\ngorithm greedily attempts to \ufb01nd a small set of cliques that cover the dependencies of which it is\nmost certain. Kearns & Mansour (1998) give a polynomial-time algorithm with provable guarantees\nfor structure learning of noisy-or bipartite networks with bounded in-degree. Their algorithm in-\ncrementally constructs the network, in each step adding a new observed variable, introducing edges\nfrom the existing latent variables to the observed variable, and then seeing if new latent variables\nshould be created. This approach requires strong assumptions, such as identical priors for the hidden\nvariables and all incoming edges for an observed variable having the same failure probabilities.\nSilva et al. (2006) study structure learning in linear models with continuous latent variables, giving\nan algorithm for discovering disjoint subsets of observed variables that have a single hidden variable\nas its parent. Recent work has used tensor methods and sparse recovery to learn linear latent variable\nmodels with graph expansion (Anandkumar et al. , 2013), and also continuous admixture models\nsuch as latent Dirichlet allocation (Anandkumar et al. , 2012a). The discrete variable setting is\nnot linear, making it non-trivial to apply these methods that rely on linearity of expectation. An\nalternative approach is to perform gradient ascent on the likelihood or use expectation maximization\n(EM). Although more robust to model error, the likelihood is nonconvex and these methods do not\nhave consistency guarantees. Elidan et al. (2001) seek \u201cstructural signatures\u201d, in their case semi-\ncliques, to use as structure candidates within structural EM (Elidan & Friedman, 2006; Friedman,\n1997; Lazic et al. , 2013). Our algorithm could be used in the same way.\nExact inference is intractable in noisy-or networks (Cooper, 1987), so \u02c7Singliar & Hauskrecht (2006)\ngive a variational EM algorithm for unsupervised learning of the parameters of a bipartite noisy-or\nnetwork. We will use this as a baseline in our experimental results.\nSpectral approaches to learning mixture models originated with Chang\u2019s spectral method (Chang\n1996; analyzed in Mossel & Roch 2005, see also Anandkumar et al. (2012b)). The binary variable\nsetting is a special case and is discussed in Lazarsfeld (1950) and Pearl & Tarsi (1986). In Halpern\n& Sontag (2013) the parameters of singly-coupled variables in bipartite networks of known structure\nare learned using mixture model learning.\nQuartet tests have been previously used for learning latent tree models (Anandkumar et al. , 2011;\nPearl & Tarsi, 1986). Our quartet test, like that of Ishteva et al. (2013) and Eriksson (2005), uses\nthe full fourth-order moment and a similar unfolding of the fourth-order moment matrix.\nBackground. We consider bipartite noisy-or Bayesian networks (G, \u21e5) with n binary latent vari-\nables U, which we denote with capital letters (e.g. X), and m observed binary variables O, which\nwe denote with lower case letters (e.g. a). The edges in the model are directed from the latent vari-\nables to the observed variables, as shown in Fig. 1. In the noisy-or framework, an observed variable\nis on if at least one of its parents is on and does not fail to activate it.\nThe entire Bayesian network is parametrized by n\u21e5m+n+m parameters. These parameters consist\nof prior probabilities on the latent variables, pX for X 2U , failure probabilities between latent and\n\n2\n\nXYZabcdefghiYabcXabcZ=?\fobserved variables, ~fX (a vector of size m), and noise or leak probabilities ~\u232b = {\u232b1, ...,\u232b m}. An\nequivalent formulation includes the noise in the model by introducing a single \u2018noise\u2019 latent variable,\nX0, which is present with probability p0 = 1 and has failure probabilities ~f0 = 1 ~\u232b. The Bayesian\nnetwork only has an edge between latent variable X and observed variable a if fX,a < 1. The\ngenerative process for the model is then:\n\n\u2022 The states of the latent variables are drawn independently: X \u21e0 Bernoulli(pX) for X 2U .\n\u2022 Each X 2U with X = 1 activates observed variable a with probability 1 fX,a.\n\u2022 An observed variable a 2O is \u201con\u201d (a = 1) if it is activated by at least one of its parents.\nThe algorithms described in this paper make substantial use of sets of moments of the observed\nvariables, particularly the negative moments. Let S\u2713O be a set of observed variables, and X\u2713U\nbe the set of parents of S. The joint distribution of a bipartite noisy-or network can be shown to have\nthe following factorization, where S = {o1, . . . , o|S|}:\n\nNG,S = P (o1 = 0, o2 = 0, . . . , o|S| = 0) = YU2X\n\n(1 pU + pU\n\nfU,oi).\n\n(1)\n\n|S|Yi=1\n\nThe full joint distribution can be obtained from the negative moments via inclusion-exclusion for-\nmulas. We denote NG to be the set of negative moments of the observed variables under (G, \u21e5). In\nthe remainder of this section we will review two results described in Halpern & Sontag (2013).\nParameter learning of singly-coupled triplets. We say that a set O of observed variables is singly-\ncoupled by a parent X if X is a parent of every member of O and there is no other parent Y that is\nshared by at least two members of O. A singly coupled set of observations is a binary mixture model,\nwhich gives rise to the next result based on a rank-2 tensor decomposition of the joint distribution.\nIf (a, b, c) are singly-coupled by X, we can learn pX and fX,a as follows. Let M1 = P (b, c, a = 0),\nM2 = P (b, c, a = 1), and M3 = M2M1\n1 . Solving for (1, 2) = eigenvalues(M3), we then have:\n\npX =\n\n1 + 2\n2 1\n\n1T (M2 1M1)1 and fX,a =\n\n1 + 1\n1 + 2\n\n.\n\n(2)\n\nSubtracting off. Because of the factored form of Equation 1, we can remove the in\ufb02uence of a\nlatent variable from the negative moments. Let X be a latent variable of G. Let S\u2713O\nbe a\nset of observations and X be the parents of S. If we know NG,S, the prior of X, and the failure\nprobabilities fX,S, we can obtain the negative moments of S under (G \\{X}, \u21e5). When S includes\nall of the children of X, this operation \u201csubtracts off\u201d or removes X from the network:\n\nNG\\X,S = YU2X\\X\n\n(1 pU + pU\n\nfU,oi) =\n\n|S|Yi=1\n\n2 Structure learning\n\nNG,S\n\n(1 pX + pXQ|S|i=1 fX,oi)\n\n.\n\n(3)\n\nOur paper focuses on learning the structure of these bipartite networks, including the number of\nlatent variables. We begin with the observation that not all structures are identi\ufb01able, even if given\nin\ufb01nite data. Suppose we applied the tensor decomposition method to the marginal distribution\n(moments) of three observed variables that share two parents. Often we can learn a network with\nthe same marginal distribution, but where these three variables have just one parent. Figure 1 gives\nan example of such a network. As a result, if we hope to be able to learn structure, we need to make\nadditional assumptions (e.g., every latent variable has at least four children).\nWe give two variants of an algorithm based on quartet tests, and prove its correctness in Section 3.\nOur approach is based on decomposing the structure learning problem into two tasks: (1) identifying\nthe latent variables, and (2) determining to which observed variables they are connected.\n\n2.1 Finding singly coupled quartets\n\nSince triplets are not suf\ufb01cient to identify a latent variable (Figure 1), we propose a new approach\nbased on identifying singly-coupled quartets. We present two methods to \ufb01nd such quartets. The\n\n3\n\n\fAlgorithm 1 STRUCTURE-LEARN\nInput: Observations S, Thresholds \u2327q, \u23270q, \u2327e.\nOutput: Latent structure Latent\n1: Latent = {}\n2: while Not Converged do\n3:\n4:\n5:\n6:\n\nT JOINT(a, b, c, d)\nT ADJUST(T, Latent)\nif PRETEST(T ,\u2327e) and 4TEST(T ,\n\u2327q,\u23270q) then\n\nfor all quartets (a, b, c, d) in S do\n\n// (a, b, c, d) are singly-coupled.\nL MIXTURE(a, b, c, d)\nchildren EXTEND(L, Latent, \u2327e)\nLatent Latent [{ (L, children)}\n\n7:\n8:\n9:\n10:\nend if\n11:\nend for\n12:\n13: end while\n\nAlgorithm 2 EXTEND\nInput: Latent variable L with singly-coupled\nchildren (a, b, c, d), currently known latent\nstructure Latent, threshold \u2327\nOutput: children, all the children of L.\n1: children={(a, fL,a), (b, fL,b),\n2: for all observable x 62 {a, b, c, d} do\n3:\n\n(c, fL,c), (d, fL,d)}\n\nSubtract off coupling parents in Latent\nfrom the moments\nif P (\u00afa,\u00afb)\n\nP (\u00afa|\u00afx)P (\u00afb|\u00afx) + \u2327 then\n\nP (\u00afa)P (\u00afb) > P (\u00afa,\u00afb|\u00afx)\nfL,x = FAILURE(a,b,x,L)\nchildren children [{ (x, fL,x)}\n\n4:\n5:\n6:\nend if\n7:\n8: end for\n9: Return children\n\nFigure 2: Structure learning. Left: Main routine of the algorithm. JOINT gives the joint distribution\nand ADJUST subtracts off the in\ufb02uence of the latent variables (Eq. 3). PRETEST \ufb01lters the set\nof candidate quartets by determining whether every triplet in a quartet has a shared parent, using\nLemma 2. 4TEST refers to either of the quartet tests described in Section 2.1. \u23270q is only used in the\ncoherence quartet test. MIXTURE refers to using Eq. 2 to learn the parameters for all triplets in a\nsingly-coupled quartet. This yields multiple estimates for each parameter and we take the median.\nRight: Algorithm to identify all of the children of a latent variable. FAILURE uses the method\noutlined in Section 2.2 (see Eq. 6) to \ufb01nd the failure probability fL,x.\n\n\ufb01rst is based on a rank test on a matrix formed from the fourth order moments and the second uses\nvariance of parameters learned from third order moments. We then present a method that uses the\npoint-wise mutual information of a triplet to identify all the other children of the new latent variable.\nThe outline of the learning algorithm is presented in Algorithm 1.\nWhile not all networks can be learned, this method allows us to de\ufb01ne a class of noisy-or networks\non which we can perform structure learning.\nDe\ufb01nition 1. A noisy-or network is quartet-learnable if there exists an ordering of its latent vari-\nables such that each one has a quartet of children which are singly coupled once the previous latent\nvariables are removed from the model. A noisy-or network is strongly quartet-learnable if all of its\nlatent variables have a singly coupled quartet of children.\n\nAn example of a quartet-learnable network is given in Figure 1.\nRank test. A candidate quartet for the rank test is a quartet where all nodes have at least one common\nparent. One way to \ufb01nd whether a candidate quartet is singly coupled is by looking directly at the\nrank of its fourth-order moments matrix. We have three ways to unfold the 2 \u21e5 2 \u21e5 2 \u21e5 2 tensor\nde\ufb01ned by these moments into a 4 \u21e5 4 matrix: we can consider the joint probability matrix of the\naggregated variables (a, b) and (c, d), of (a, c) and (b, d), or of (a, d) and (b, c). We discuss the rank\nproperty for the \ufb01rst unfolding, but note that it holds for all three.\nLet M be the 4 \u21e5 4 matrix obtained this way, and S be the set of parents that are parents of both\n(a, b) and (c, d). For all S \u21e2S let qS and rS be the vectors of the probabilities of (a, b) and (c, d)\nrespectively given that S is the set of parents that are active. Then:\n\n0@YX2S\n\npX YY 2S\\S\n\n(1 pY )1A qSrT\n\nS .\n\nIn particular, this means that if there is only one parent shared between (a, b) and (c, d), M is the\nsum of two rank 1 matrices, and thus is at most rank 2.\n\nM = XS\u21e2S\n\n4\n\n\fConversely, if |S| > 1, M is the sum of at least 4 rank 1 matrices, and its elements are polynomial\nexpressions of the parameters of the model. The determinant itself is then a polynomial function\nof the parameters of the model, i.e. P (pX, fX,u 8X 2S , u 2{ a, b, c, d}). We give examples in\nthe supplementary material of parameter settings showing that P 6\u2318 0, hence the set of its roots has\nmeasure 0, which means that the third largest eigenvalue (using the eigenvalues\u2019 absolute values) of\nM is non-zero with probability one.\nThis will allow us to determine whether a candidate quartet is singly coupled by looking at the third\neigenvalues of the three unfoldings of its joint distribution tensor. However, for the algorithm to be\npractical, we need a slightly stronger formalization of the property:\nDe\ufb01nition 2. We say that a model is \u270f-rank-testable if for any quartet {a, b, c, d} that share a parent\nU and any non-empty set of latent variables V such that U 62 V and 9V 2V , (fV,b 6= 1^ fV,c 6= 1),\nthe third eigenvalue of the moments matrix M corresponding to the sub-network {U, a, b, c, d}[V\nis at least \u270f.\n\nAny (\ufb01nite) noisy-or network whose parameters were drawn at random is \u270f-rank-testable for some\n\u270f with probability 1. The special case where all failure probabilities are equal also falls within this\nframework, provided they are not too close to 0 or 1. We can then determine whether a quartet is\nsingly coupled by testing whether the third eigenvalues of all of the three unfoldings of the joint\ndistributions are below a threshold, \u2327q. If this test succeeds, we learn its parameters using Eq. 2.\nCoherence test. Let {a, b, c, d} be a quartet of observed variables. To determine whether it is singly\ncoupled, we can also apply Eq. 2 to learn the parameters of triplets (a, b, c), (a, b, d), (a, c, d) and\n(b, c, d) as if they were singly coupled. This gives us four overlapping sets of parameters. If the\nvariance of parameter estimates exceeds a threshold we know that the quartet is not singly coupled.\nNote that agreement between the parameters learned is necessary but not suf\ufb01cient to determine\nthat (a, b, c, d) are singly coupled. For example, in the case of a fully connected graph with two\nparents, four children and identical failure probabilities, the third-order moments of any triplet are\nidentical, hence the parameters learned will be the same. Lemma 1, however, states that the moments\ngenerated from the estimated parameters can only be equal to the true moments if the quartet is\nactually singly coupled.\nLemma 1. If the model is \u270f-rank-testable and (a, b, c, d) are not singly coupled, then if MR repre-\nsents the reconstructed moments and M the true moments, we have:\n\n||MR M||1 >\u21e3 \u270f\n8\u23184\n\n.\n\nThis can be proved using a result on eigenvalue perturbation from Elsner (1985) for an unfolding\nof the moments\u2019 tensor. These two properties lead to the following algorithm: First try to learn the\nparameters as if the quartet were singly coupled. If the variance of the parameter estimates exceeds\na threshold, then reject the quartet. Next, check whether we can reconstruct the moments using the\nmean of the parameter estimates. Accept the quartet as singly-coupled if the reconstruction error is\nbelow a second threshold.\n\n2.2 Extending Latent Variables\n\nOnce we have found a singly coupled quartet (a, b, c, d), the second step is to \ufb01nd all other chil-\ndren of the coupling parent A. To that end, we can use a property of the conditional point-\nwise mutual information (CPMI) that we introduce in this section.\nIn this section, we use the\nnotation \u00afa to denote the event a = 0. The CPMI between a and b given x is de\ufb01ned as\nCPMI(a, b|x) \u2318 P (\u00afa, \u00afb|\u00afx)/(P (\u00afa|\u00afx)P (\u00afb|\u00afx)). We will compare it to the point-wise mutual infor-\nmation (PMI) between a and b de\ufb01ned as PMI(a, b) \u2318 P (\u00afa, \u00afb)/(P (\u00afa)P (\u00afb)).\nLet (a, b) be two observed variables that we know only share one parent A, and let x be any another\nobserved variable. We show how the CPMI between a and b given x can be used to \ufb01nd fA,x, the\nfailure probability of x given A. Our algorithm requires that the priors of all of the hidden variables\nbe less than 1/2.\nFor any observed variable x, the following lemma states that CPMI(a, b|x) 6= PMI(a, b) if and only\nif a, b and x share a parent. Since the only latent variable that has both a and b as children is A, this\nis equivalent to saying that x is a child of A.\n\n5\n\n\fLemma 2. Let (a, b, x) be three observed variables in a noisy-or network, and let Ua,b be the set of\ncommon parents of a and b. For U 2U a,b, de\ufb01ning\n=\n\nP (U, \u00afx)\n\npU fU,x\n\n,\n\n(4)\n\npU|\u00afx =\n\nP (\u00afx)\n\n1 pU + pU fU,x\n\nwe have pU|\u00afx \uf8ff pU. Furthermore,\n= YU2Ua,b\n\nP (\u00afa, \u00afb|\u00afx)\nP (\u00afa|\u00afx)P (\u00afb|\u00afx)\n\n(1 pU|\u00afx + pU|\u00afxfU,afU,b)\n\n(1 pU|\u00afx + pU|\u00afxfU,a)(1 pU|\u00afx + pU|\u00afxfU,b) \uf8ff\n\nP (\u00afa, \u00afb)\nP (\u00afa)P (\u00afb)\n\n,\n\nwith equality if and only if (a, b, x) do not share a parent.\n\nThe proof for Lemma 2 is given in the supplementary material. As a result, if a and b have only\nparent A in common, we can write:\n\nR \u2318 CPMI(a, b|x) =\n\nP (\u00afa, \u00afb|\u00afx)\nP (\u00afa|\u00afx)P (\u00afb|\u00afx)\n\n=\n\n(1 pA|\u00afx + pA|\u00afxfA,afA,b)\n\n(1 pA|\u00afx + pA|\u00afxfA,a)(1 pA|\u00afx + pA|\u00afxfA,b)\n\n.\n\nQ(x) = R(fA,a 1)(fA,b 1)x2 + [R(fA,a + fA,b 2) (fA,afA,b 1)]x + R 1.\n\nWe can equivalently write this equation as Q(pA|\u00afx) = 0 for the quadratic function Q(x) given by:\n(5)\nMoreover, we can show that Q0(x) = 0 for some x > 1/2, hence one of the roots of Q is always\ngreater than 1/2. In our framework, we know that pA|\u00afx \uf8ff pA \uf8ff 1\n2, hence pA|\u00afx is simply the smaller\nroot of Q. After solving for pA|\u00afx, we can obtain fA,x using Eq. 4:\n\nfA,x =\n\npA|\u00afx(1 pA)\npA(1 pA|\u00afx)\n\n.\n\n(6)\n\nExtending step. Once we \ufb01nd a singly-coupled quartet (a, b, c, d) with common parent A, Lemma 2\nallows us to determine whether a new variable x is also a child of A. Notice that for this step we\nonly need to use two of the children in {a, b, c, d}, which we arbitrarily choose to be a and b. If x is\nfound to be a child of A, we can solve for fA,x using Eqs. 5 and 6. Algorithm 2 combines these two\nsteps to \ufb01nd the parameters of all the children of A after a singly-coupled quartet has been found.\nParameter learning with known structure. When the structure of the network is known, singly-\ncoupled triplets are suf\ufb01cient for identi\ufb01ability without resorting to the quartet tests in Section 2.1.\nThat setting was previously studied in Halpern & Sontag (2013), which required every edge to be\npart of a singly coupled triplet or pair for its parameters to be learnable (possibly after subtracting\noff latent variables). Our new CPMI technique improves this result by enabling us to learn all failure\nprobabilities for a latent variable\u2019s children even if the variable has only one singly coupled triplet.\n\n3 Sample complexity analysis\n\nIn Section 2, we gave two variants of an algorithm to learn the structure of a class of noisy-or\nnetworks. We now want to upper bound the number of samples it requires to learn the structure of\nthe network correctly with high probability, as a function of the ranges in which the parameters are\nfound. All priors are in [pmin, 1/2], all failures probabilities are in [fmin, fmax], and the marginal\nprobabilities of an observed variable x being off is lower bounded by nmin \uf8ff P (\u00afx). The full proofs\nfor these results are given in the supplementary materials.\nTheorem 1. If a network with m observed variables is strongly quartet-learnable and \u21e3-rank-\ntestable, then its structure can be learned in polynomial time with probability (1 ) and with\na polynomial number of samples equal to:\n\nAfter N samples, the additive error on any of the parameters \u270f(N ) is bounded with probability 1 \nby:\n\nO\u21e3 max\u21e3 1\n\n\u21e38 ,\n\nn8\nminp2\n\n1\n\nmin(1 fmax)8\u2318 ln\u21e3 2m\n \u2318\u2318.\nrln\u21e3 2m\n \u2318\n\n1\n\npN\u2318.\n\nf 18\nmin(1 fmax)6n28\n\nminp13\n\nmin\n\n\u270f(N ) \uf8ff O\u21e3\n\n6\n\n\fWe obtain this result by determining the accuracy we need for our tests to be provably correct,\nand bounding how much the error in the output of the parameter learning algorithms depends on\nthe input. This proves that we can learn a class of strongly quartet-learnable noisy-or networks\nin polynomial time and sample complexity. Next, we show how to extend the analysis to quartet-\nlearnable networks as de\ufb01ned in Section 2 by subtracting off latent variables that we have previously\nlearned. If some of the removed latent variables were coupling for an otherwise singly coupled\nquartet, we then discover new latent variables, and repeat the operation. If a network is quartet-\nlearnable, we can \ufb01nd all of the latent variables in a \ufb01nite number of subtracting off steps, which\nwe call the depth of the network (thus, a strongly quartet-learnable network has depth 0). To prove\nthat the structure learning algorithm remains correct, we simply need to show that the estimated\nsubtracted off moments remain close to the true ones.\nLemma 3. If the additive error on the estimated negative moments of an observed quartet C and on\nthe parameters for W latent variables X1, . . . , XW whose in\ufb02uence we want to remove from C is at\nmost \u270f, then the error on the subtracted off moments for C is O(W 4W \u270f).\nWe de\ufb01ne the width of the network to be the maximum number of parents that need to be subtracted\noff to be able to learn the parameters for a new singly-coupled quartet (this is typically a small\nconstant). This leads to the following result:\nTheorem 2. If a network with m observed variables is quartet-learnable at depth d, is \u21e3-rank-\ntestable, and has width W , then its structure can be learned with probability (1 ) with NS\nsamples, where:\nmin(1 fmax)8\u2318 ln\u21e3 2m\n \u2318\u2318.\nNS = O\u21e3\u21e3\n\n\u21e5 max\u21e3 1\n\nmin\u23182d\n\nf 18\nmin(1 fmax)6n28\n\nminp13\n\nW 4W\n\n1\n\n\u21e38 ,\n\nn8\nminp2\n\nThe left hand side of this expression has to do with the error introduced in the estimate of the\nparameters each time we do a subtracting off step, which by de\ufb01nition occurs at most d times,\nhence the exponent. We notice that the bounds do not depend directly on the number of latent\nvariables, indicating that we can learn networks with many latent variables, as long as the number\nof subtraction steps is small. While this bound is useful for proving that the sample complexity\nis indeed polynomial, in the experiments section we show that in practice our algorithm obtains\nreasonable results on sample sizes well below the theoretical bound.\n\n4 Experiments\n\nDepth of aQMR-DT. Halpern & Sontag (2013) previously showed that the parameters of the\nanonymized QMR-DT network for medical diagnosis (provided by the University of Pittsburgh\nthrough the efforts of Frances Connell, Randolph A. Miller, and Gregory F. Cooper) could be learned\nfrom data recording only symptoms if the structure is known. We now show that the structure can\nalso be learned. Here we assume that the quartet tests are perfect (i.e. in\ufb01nite data setting). Table 1\ncompares the depth of the aQMR-DT network using triplets and quartets. Structure learning dis-\ncovers all but four of the diseases, two of which would not be learnable even if the structure were\nknown. These two diseases are discussed in Halpern & Sontag (2013) and share all of their children\nexcept for one symptom each, resulting in a situation where no singly-coupled triplets can be found.\nThe additional two diseases that cannot be learned share all but two children with each other. Thus,\nfor these two latent variables, singly-coupled triplets exist but singly-coupled quartets do not.\nImplementation. We test the performance of our algorithm on the synthetic image dataset used in\n\u02c7Singliar & Hauskrecht (2006). The Bayesian network consists of 8 latent variables and 64 observed\nvariables, arranged in an 8x8 grid of pixels. Each of the latent variables connects to a subset of the\nobserved pixels (see Figure 3). The latent variable priors are set to 0.25, the failure probabilities\nfor all edges are set to 0.1, and leak probabilities are set to 0.001. We generate samples from the\nnetwork and use them to test the ability of our algorithm to discover the latent variables and network\nstructure from the samples. The network is quartet learnable, but the \ufb01rst and last of the ground truth\nsources shown in Figure 3 can only be learned after a subtraction step.\nWe use variational EM (\u02c7Singliar & Hauskrecht, 2006) as a baseline, using 16 random initializations\nand choosing the run with the highest lower bound on likelihood. We found that multiple initializa-\ntions substantially improved the quality of its result. The variational algorithm is given the correct\n\n7\n\n\fTriplets (known structure)\n\npriors learned\n\nedges learned\n\nQuartets (unknown structure)\ndiseases discovered\n\nedges learned\n\ndepth\n\n0\n1\n2\n3\ninf\n\n527\n39\n2\n0\n2\n\n43,139\n2,109\n100\n0\n122\n\ndepth\n\n0\n1\n2\n3\ninf\n\n469\n82\n13\n2\n4\n\n39,522\n4,875\n789\n86\n198\n\nTable 1: Right: The depth at which latent variables (i.e., diseases) are discovered and parameters learned in\nthe aQMR-DT network for medical diagnosis (Shwe et al. , 1991) using the quartet-based structure learning\nalgorithm, assuming in\ufb01nite data. Left: Comparison to parameter learning with known structure, using one\nsingly-coupled triplet to learn the failure probabilities for all of a disease\u2019s symptoms. The parameters learned\nat level 0 can be learned without any subtracting-off step. Those marked depth inf cannot be learned.\n\nnumber of sources as input. For our algorithm, we use the rank-based quartet test, which has the\nadvantage of requiring only one threshold, \u2327q, compared to the two needed by the coherence test. In\nour algorithm, the thresholds determine the number of discovered latent variables (sources).\nQuartets are pre-\ufb01ltered using pointwise mutual information to reject quartets that have non-siblings\n(i.e. (a, b, c, d) where a and b are likely not siblings). All quartets that fail the pretest or the rank\ntest are discarded. We sort the remaining quartets by third singular value and proceed from lowest\nto highest. For each quartet in sorted order we check if it overlaps with a latent variable previously\nlearned in this round. If it does not, we create a new latent variable and use the EXTEND step to\n\ufb01nd all of its children. The algorithm converges when no quartets pass the threshold.\nFigure 3 shows how the algorithms perform on the synthetic dataset with varying numbers of sam-\nples. Unless otherwise speci\ufb01ed, our experiments use threshold values \u2327q = 0.01 and \u2327e = 0.1.\nExperiments exploring the sensitivity of the algorithm to these thresholds can be found in the sup-\nplementary material. The running time of the quartet algorithm is under 6 minutes for 10,000 sam-\nples using a parallel implementation with 16 cores. For comparison, the variational algorithm on\nthe same samples takes 4 hours using 16 cores simultaneously (one random initialization per core)\non the same machine. The variational run-time scales linearly with sample size while the quartet\nalgorithm is independent of sample size once the quartet marginals are computed.\n\nSample size \n\nVariational EM \n\nQuartet Structure Learning \n\nGround truth sources \n\nd=0 \n\nd=1 \n\n100 \n500 \n1000 \n2000 \n10000 \n10000* \n\nFigure 3: A comparison between the variational algorithm of \u02c7Singliar & Hauskrecht (2006) and the quartet\nalgorithm as the number of samples increases. The true network structure is shown on the right, with one image\nfor each of the eight latent variables (sources). For each edge from a latent variable to an observed variable, the\ncorresponding pixel intensity speci\ufb01es 1 fX,a (black means no edge). The results of the quartet algorithm are\ndivided by depth. Column d=0 shows the sources learned without any subtraction and d=1 shows the sources\nlearned after a single subtraction step. Nothing was learned at d > 1. The sample size of 10,000* refers to\n10,000 samples using an optimized value for the threshold of the rank-based quartet test (\u2327q = 0.003).\n\n5 Conclusion\n\nWe presented a novel algorithm for learning the structure and parameters of bipartite noisy-or\nBayesian networks where the top layer consists completely of latent variables. Our algorithm can\nlearn a broad class of models that may be useful for factor analysis and unsupervised learning. The\nstructure learning algorithm does not depend on an ability to estimate the parameters in strongly\nquartet-learnable networks. As a result, it may be possible to generalize the approach beyond the\nnoisy-or setting to other bipartite Bayesian networks, including those with continuous variables and\ndiscrete variables of more than two states.\n\n8\n\n\fReferences\nAnandkumar, Anima, Chaudhuri, Kamalika, Hsu, Daniel, Kakade, Sham, Song, Le, & Zhang, Tong.\n2011. Spectral Methods for Learning Multivariate Latent Tree Structure. Proceedings of NIPS\n24, 2025\u20132033.\n\nAnandkumar, Anima, Foster, Dean, Hsu, Daniel, Kakade, Sham, & Liu, Yi-Kai. 2012a. A spectral\n\nalgorithm for latent Dirichlet allocation. Proceedings of NIPS 25, 926\u2013934.\n\nAnandkumar, Animashree, Hsu, Daniel, & Kakade, Sham M. 2012b. A method of moments for\n\nmixture models and hidden Markov models. In: Proceedings of COLT 2012.\n\nAnandkumar, Animashree, Javanmard, Adel, Hsu, Daniel J, & Kakade, Sham M. 2013. Learning\n\nLinear Bayesian Networks with Latent Variables. Pages 249\u2013257 of: Proceedings of ICML.\n\nChang, Joseph T. 1996. Full reconstruction of Markov models on evolutionary trees: identi\ufb01ability\n\nand consistency. Mathematical biosciences, 137(1), 51\u201373.\n\nCooper, Gregory F. 1987. Probabilistic Inference Using Belief Networks Is NP-Hard. Technical\n\nReport BMIR-1987-0195. Medical Computer Science Group, Stanford University.\n\nElidan, Gal, & Friedman, Nir. 2006. Learning hidden variable networks: The information bottleneck\n\napproach. Journal of Machine Learning Research, 6(1), 81.\n\nElidan, Gal, Lotner, Noam, Friedman, Nir, & Koller, Daphne. 2001. Discovering hidden variables:\n\nA structure-based approach. Advances in Neural Information Processing Systems, 479\u2013485.\n\nElsner, Ludwig. 1985. An optimal bound for the spectral variation of two matrices. Linear algebra\n\nand its applications, 71, 77\u201380.\n\nEriksson, Nicholas. 2005. Tree construction using singular value decomposition. Algebraic Statistics\n\nfor computational biology, 347\u2013358.\n\nFriedman, Nir. 1997. Learning Belief Networks in the Presence of Missing Values and Hidden\n\nVariables. Pages 125\u2013133 of: ICML \u201997.\n\nHalpern, Yoni, & Sontag, David. 2013. Unsupervised Learning of Noisy-Or Bayesian Networks.\n\nIn: Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-13).\n\nIshteva, Mariya, Park, Haesun, & Song, Le. 2013. Unfolding Latent Tree Structures using 4th Order\n\nTensors. In: ICML \u201913.\n\nKearns, Michael, & Mansour, Yishay. 1998. Exact inference of hidden structure from sample data\n\nin noisy-OR networks. Pages 304\u2013310 of: Proceedings of UAI 14.\n\nLazarsfeld, Paul. 1950. Latent Structure Analysis. In: Stouffer, Samuel, Guttman, Louis, Suchman,\nEdward, Lazarsfeld, Paul, Star, Shirley, & Clausen, John (eds), Measurement and Prediction.\nPrinceton, New Jersey: Princeton University Press.\n\nLazic, Nevena, Bishop, Christopher M, & Winn, John. 2013. Structural Expectation Propagation:\n\nBayesian structure learning for networks with latent variables. In: Proceedings of AISTATS 16.\n\nMartin, J, & VanLehn, Kurt. 1995. Discrete factor analysis: Learning hidden variables in Bayesian\n\nnetworks. Tech. rept. Department of Computer Science, University of Pittsburgh.\n\nMossel, Elchanan, & Roch, S\u00b4ebastien. 2005. Learning nonsingular phylogenies and hidden Markov\n\nmodels. Pages 366\u2013375 of: Proceedings of 37th STOC. ACM.\n\nPearl, Judea, & Tarsi, Michael. 1986. Structuring causal trees. Journal of Complexity, 2(1), 60\u201377.\nSaund, Eric. 1995. A multiple cause mixture model for unsupervised learning. Neural Computation,\n\n7(1), 51\u201371.\n\nShwe, Michael A, Middleton, B, Heckerman, DE, Henrion, M, Horvitz, EJ, Lehmann, HP, &\nCooper, GF. 1991. Probabilistic diagnosis using a reformulation of the INTERNIST-1/QMR\nknowledge base. Meth. Inform. Med, 30, 241\u2013255.\n\nSilva, Ricardo, Scheine, Richard, Glymour, Clark, & Spirtes, Peter. 2006. Learning the structure of\n\nlinear latent variable models. The Journal of Machine Learning Research, 7, 191\u2013246.\n\n\u02c7Singliar, Tom\u00b4a\u02c7s, & Hauskrecht, Milo\u02c7s. 2006. Noisy-or component analysis and its application to\n\nlink analysis. The Journal of Machine Learning Research, 7, 2189\u20132213.\n\n9\n\n\f", "award": [], "sourceid": 1127, "authors": [{"given_name": "Yacine", "family_name": "Jernite", "institution": "Courant Institute, NYU"}, {"given_name": "Yonatan", "family_name": "Halpern", "institution": "NYU"}, {"given_name": "David", "family_name": "Sontag", "institution": "NYU"}]}