{"title": "Robust Learning of Fixed-Structure Bayesian Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 10283, "page_last": 10295, "abstract": "We investigate the problem of learning Bayesian networks in a robust model where an $\\epsilon$-fraction of the samples are adversarially corrupted. In this work, we study the fully observable discrete case where the structure of the network is given. Even in this basic setting, previous learning algorithms either run in exponential time or lose dimension-dependent factors in their error guarantees. We provide the first computationally efficient robust learning algorithm for this problem with dimension-independent error guarantees. Our algorithm has near-optimal sample complexity, runs in polynomial time, and achieves error that scales nearly-linearly with the fraction of adversarially corrupted samples. Finally, we show on both synthetic and semi-synthetic data that our algorithm performs well in practice.", "full_text": "Robust Learning of Fixed-Structure\n\nBayesian Networks\n\nYu Cheng\n\nDepartment of Computer Science\n\nDuke University\n\nDurham, NC 27708\n\nyucheng@cs.duke.edu\n\nIlias Diakonikolas\n\nDepartment of Computer Science\nUniversity of Southern California\n\nLos Angeles, CA 90089\n\nilias.diakonikolas@gmail.com\n\nDaniel M. Kane\n\nDepartment of Computer Science and Engineering\n\nUniversity of California, San Diego\n\nLa Jolla, CA 92093\ndakane@ucsd.edu\n\nAlistair Stewart\n\nDepartment of Computer Science\nUniversity of Southern California\n\nLos Angeles, CA 90089\nstewart.al@gmail.com\n\nAbstract\n\nWe investigate the problem of learning Bayesian networks in a robust model where\nan \u0001-fraction of the samples are adversarially corrupted. In this work, we study\nthe fully observable discrete case where the structure of the network is given.\nEven in this basic setting, previous learning algorithms either run in exponential\ntime or lose dimension-dependent factors in their error guarantees. We provide\nthe \ufb01rst computationally ef\ufb01cient robust learning algorithm for this problem with\ndimension-independent error guarantees. Our algorithm has near-optimal sample\ncomplexity, runs in polynomial time, and achieves error that scales nearly-linearly\nwith the fraction of adversarially corrupted samples. Finally, we show on both\nsynthetic and semi-synthetic data that our algorithm performs well in practice.\n\n1\n\nIntroduction\n\nProbabilistic graphical models [KF09] provide an appealing and unifying formalism to succinctly\nrepresent structured high-dimensional distributions. The general problem of inference in graphi-\ncal models is of fundamental importance and arises in many applications across several scienti\ufb01c\ndisciplines, see [WJ08] and references therein. In this work, we study the problem of learning\ngraphical models from data [Nea03, DSA11]. There are several variants of this general learning\nproblem depending on: (i) the precise family of graphical models considered (e.g., directed, undi-\nrected), (ii) whether the data is fully or partially observable, and (iii) whether the structure of the\nunderlying graph is known a priori or not (parameter estimation versus structure learning). This\nlearning problem has been studied extensively along these axes during the past \ufb01ve decades, (see,\ne.g., [CL68, Das97, AKN06, WRL06, AHHK12, SW12, LW12, BMS13, BGS14, Bre15]) resulting\nin a beautiful theory and a collection of algorithms in various settings.\nThe main vulnerability of all these algorithmic techniques is that they crucially rely on the assumption\nthat the samples are precisely generated by a graphical model in the given family. This simplifying\nassumption is inherent for known guarantees in the following sense: if there exists even a very small\nfraction of arbitrary outliers in the dataset, the performance of known algorithms can be totally\ncompromised. It is important to explore the natural setting when the aforementioned assumption\nholds only in an approximate sense. Speci\ufb01cally, we study the following broad question:\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fQuestion 1 (Robust Learning of Graphical Models). Can we ef\ufb01ciently learn graphical models\nwhen a constant fraction of the samples are corrupted, or equivalently, when the model is slightly\nmisspeci\ufb01ed?\n\nIn this paper, we focus on the model of corruptions considered in [DKK+16] (De\ufb01nition 1) which\ngeneralizes many other existing models, including Huber\u2019s contamination model [Hub64]. Intuitively,\ngiven a set of good samples (from the true model), an adversary is allowed to inspect the samples\nbefore corrupting them, both by adding corrupted points and deleting good samples. In contrast, in\nHuber\u2019s model, the adversary is oblivious to the samples and is only allowed to add bad points.\nWe would like to design robust learning algorithms for Question 1 whose sample complexity, N, is\nclose to the information-theoretic minimum, and whose computational complexity is polynomial\nin N. We emphasize that the crucial requirement is that the error guarantee of the algorithm is\nindependent of the dimensionality d of the problem.\n\n1.1 Formal Setting and Our Results\n\nIn this work, we study Question 1 in the context of Bayesian networks [JN07]. We focus on the\nfully observable case when the underlying network is given. In the non-robust setting, this learning\nproblem is straightforward: the \u201cempirical estimator\u201d (which coincides with the maximum likelihood\nestimator) is known to be sample and computationally ef\ufb01cient [Das97]. In sharp contrast, even this\nmost basic regime is surprisingly challenging in the robust setting. For example, the very special case\nof robustly learning a Bernoulli product distribution (corresponding to an empty network with no\nedges) was analyzed only recently in [DKK+16].\nTo formally state our results, we \ufb01rst give a detailed description of the corruption model we study.\nDe\ufb01nition 1 (\u0001-Corrupted Samples). Given 0 < \u0001 < 1/2 and a distribution family P, the algorithm\nspeci\ufb01es some number of samples N, and N samples X1, X2, . . . , XN are drawn from some (un-\nknown) ground-truth P \u2208 P. The adversary is allowed to inspect P and the samples, and replaces\n\u0001N of them with arbitrary points. The set of N points is then given to the algorithm. We say that a\nset of samples is \u0001-corrupted if it is generated by this process.\n\nBayesian Networks. Fix a directed acyclic graph, G, whose vertices are labelled [d] def=\n{1, 2, . . . , d} in topological order (every edge points from a vertex with a smaller index to one\nwith a larger index). We will denote by Parents(i) the set of parents of node i in G. A probability\ndistribution P on {0, 1}d is de\ufb01ned to be a Bayesian network (or Bayes net) with graph G if for\neach i \u2208 [d], we have that PrX\u223cP [Xi = 1 | X1, . . . , Xi\u22121] depends only on the values Xj where\nj \u2208 Parents(i). Such a distribution P can be speci\ufb01ed by its conditional probability table.\nDe\ufb01nition 2 (Conditional Probability Table of Bayesian Networks). Let P be a Bayesian network\nwith graph G. Let \u0393 be the set {(i, a) : i \u2208 [d], a \u2208 {0, 1}|Parents(i)|}. Let m = |\u0393|. For (i, a) \u2208 \u0393,\nthe parental con\ufb01guration \u03a0i,a is de\ufb01ned to be the event that XParents(i) = a. The conditional\nprobability table p \u2208 [0, 1]m of P is given by pi,a = PrX\u223cP [Xi = 1 | \u03a0i,a] .\nNote that P is determined by G and p. We will frequently index p as a vector. We use the notation pk\nand the associated events \u03a0k, where each k \u2208 [m] stands for an (i, a) \u2208 \u0393 lexicographically ordered.\n\nOur Results. We give the \ufb01rst ef\ufb01cient robust learning algorithm for Bayesian networks with a\nknown graph G. Our algorithm has information-theoretically near-optimal sample complexity, runs\nin time polynomial in the size of the input (the samples), and provides an error guarantee that scales\nnear-linearly with the fraction of adversarially corrupted samples, under the following restrictions:\nFirst, we assume that each parental con\ufb01guration is reasonably likely. Intuitively, this assumption\nseems necessary because we need to observe each con\ufb01guration many times in order to learn the\nassociated conditional probability to good accuracy. Second, we assume that each of the conditional\nprobabilities is balanced, i.e., bounded away from 0 and 1. This assumption is needed for technical\nreasons. In particular, we need this to show that a good approximation to the conditional probability\ntable implies that the corresponding Bayesian network is close in total variation distance.\nFormally, we say that a Bayesian network is c-balanced, for some c > 0, if all coordinates of the\ncorresponding conditional probability table are between c and 1 \u2212 c. Throughout the paper, we use\n\n2\n\n\fi=1 2|Parents(i)| for the size of the conditional probability table of P , and \u03b1 for the minimum\nprobability of parental con\ufb01guration of P : \u03b1 = min(i,a)\u2208S PrP [\u03a0i,a]. We now state our main result.\nTheorem 3 (Main). Fix 0 < \u0001 < 1/2. Let P be a c-balanced Bayesian network on {0, 1}d with known\n\nm =(cid:80)d\nstructure G. Assume \u03b1 \u2265 \u2126(\u0001(cid:112)log(1/\u0001)/c). Let S be an \u0001-corrupted set of N =(cid:101)\u2126(m log(1/\u03c4 )/\u00012)\nprobability at least 1 \u2212 \u03c4, dTV (P, Q) \u2264 \u0001(cid:112)ln(1/\u0001)/(\u03b1c). Our algorithm runs in time (cid:101)O(N d2/\u0001).\n\nsamples from P . 1 Given G, \u0001, \u03c4, and S, we can compute a Bayesian network Q such that, with\n\nOur algorithm is given in Section 3. We \ufb01rst note that the sample complexity of our algorithm is\nnear-optimal for learning Bayesian networks with known structure. The following sample complexity\nlower bound holds even without corrupted samples:\nFact 4 (Sample Complexity Lower Bound, [CDKS17]). Let BN d,f denote the family of Bernoulli\nBayesian networks on d variables such that every node has at most f parents. The worst-case\nsample complexity of learning BN d,f , within total variation distance \u0001 and with probability 9/10, is\n\u2126(2f \u00b7 d/\u00012) for all f \u2264 d/2 when the graph structure is known.\nConsider Bayes nets whose average in-degree is close to the maximum in-degree, that is, when\nm = \u0398(2f d), the sample complexity lower bound in Fact 4 becomes \u2126(m/\u00012), so our sample\ncomplexity is optimal up to polylogarithmic factors.\nWe remark that Theorem 3 is most useful when c is a constant and the Bayesian network has\nbounded fan-in f. In this case, the condition on \u03b1 follows from the c-balanced assumption: When\nboth c and f are constants, \u03b1 = min(i,a)\u2208S PrP [\u03a0i,a] \u2265 cf is also a constant, so the condition\n\ncf \u2265 \u2126(\u0001(cid:112)log(1/\u0001)) automatically hold when \u0001 is smaller than some constant. On the other hand,\n\nthe problem of learning Bayesian networks is less interesting when the fan-in is too large. For\nexample, if some node has f = \u03c9(log(d)) parents, then the size of the conditional probability table\nis at least 2f , which is super-polynomial in the dimension d.\n\nExperiments. We performed an experimental evaluation of our algorithm on both synthetic and\nreal data. Our evaluation allowed us to verify the accuracy and the sample complexity rates of our\ntheoretical results. In all cases, the experiments validate the usefulness of our algorithm, which\nsigni\ufb01cantly outperforms previous approaches, almost exactly matching the best rate without noise.\n\nRelated Work. Question 1 \ufb01ts in the framework of robust statistics [HR09, HRRS86]. Classical\nestimators from this \ufb01eld can be classi\ufb01ed into two categories: either (i) they are computationally\nef\ufb01cient but incur an error that scales polynomially with the dimension d, or (ii) they are provably\nrobust (in the aforementioned sense) but are hard to compute. In particular, essentially all known\nestimators in robust statistics (e.g., the Tukey median [Tuk75]) have been shown [JP78, Ber06, HM13]\nto be intractable in the high-dimensional setting. We note that the robustness requirement does not\ntypically pose information-theoretic impediments for the learning problem. In most cases of interest\n(see, e.g., [CGR15, CGR16, DKK+16]), the sample complexity of robust learning is comparable to\nits (easier) non-robust variant. The challenge is to design computationally ef\ufb01cient algorithms.\nEf\ufb01cient robust estimators are known for various low-dimensional structured distributions (see,\ne.g., [DDS14, CDSS13, CDSS14a, CDSS14b, ADLS16, ADLS17, DLS18]). However, the robust\nlearning problem becomes surprisingly challenging in high dimensions. Recently, there has been\nalgorithmic progress on this front: [DKK+16, LRV16] give polynomial-time algorithms with im-\nproved error guarantees for certain \u201csimple\u201d high-dimensional structured distributions. The results\nof [DKK+16] apply to simple distributions, including Bernoulli product distributions, Gaussians,\nand mixtures thereof (under some natural restrictions). Since the works of [DKK+16, LRV16],\ncomputationally ef\ufb01cient robust estimation in high dimensions has received considerable attention\n(see, e.g., [DKS17, DKK+17, BDLS17, DKK+18a, DKS18b, DKS18a, HL18, KSS18, PSBR18,\nDKK+18b, KKM18, DKS18c, LSLC18]).\n\n1.2 Overview of Algorithmic Techniques\n\nOur algorithmic approach builds on the framework of [DKK+16] with new technical and conceptual\nideas. At a high level, our algorithm works as follows: We draw an \u0001-corrupted set of samples from a\n\n1Throughout the paper, we use (cid:101)O(f ) to denote O(f polylog(f )).\n\n3\n\n\fBayesian network P with known structure, and then iteratively remove samples until we can return\nthe empirical conditional probability table.\nFirst, we associate a vector F (X) to each sample X so that learning the mean of F (X) to good\naccuracy is suf\ufb01cient to recover the distribution. In the case of binary products, F (X) is simply X,\nwhile in our case we need to take into account additional information about conditional means.\nFrom this point, our algorithm will try to do one of two things: Either we show that the sample mean\nof F (X) is close to the conditional mean of the true distribution (in which case we can already learn\nthe ground-truth Bayes net P ), or we are able to produce a \ufb01lter, i.e., we can remove some of our\nsamples, and it is guaranteed that we throw away more bad samples than good ones. If we produce a\n\ufb01lter, we then iterate on those samples that pass the \ufb01lter.\nTo produce a \ufb01lter, we compute a matrix M which is roughly the empirical covariance matrix of\nF (X). We show that if the corruptions are suf\ufb01cient to notably disrupt the sample mean of F (X),\nthere must be many erroneous samples that are all far from the mean in roughly the same direction,\nand we can detect this direction by looking at the largest eigenvector of M. If we project all samples\nonto this direction, concentration bounds of F (X) will imply that almost all samples far from the\nmean are erroneous, and thus \ufb01ltering them out will provide a cleaner set of samples.\n\nOrganization. Section 2 contains some technical results speci\ufb01c to Bayesian networks that we\nneed. Section 3 gives the details of our algorithm and an overview of its analysis. In Section 4, we\npresent the experimental evaluations. In Section 5, we conclude and propose directions for future\nwork. Due to space constraints, we defer the proofs of the technical lemmas to the full version of the\npaper.\n\n2 Technical Preliminaries\n\nThe structure of this section is as follows: First, we bound the total variation distance between two\nBayes nets in terms of their conditional probability tables. Second, we de\ufb01ne a function F (x, q),\nwhich takes a sample x and returns an m-dimensional vector that contains information about the\nconditional means. Finally, we derive a concentration bound from Azuma\u2019s inequality.\nLemma 5. Suppose that: (i) mink\u2208[m] PrP [\u03a0k] \u2265 \u0001, and (ii) P or Q is c-balanced, and (iii)\n\nk PrP [\u03a0k](pk \u2212 qk)2 \u2264 \u0001. Then we have that dTV (P, Q) \u2264 \u0001.\n\n(cid:112)(cid:80)\n\n3\nc\n\nLemma 5 says that to learn a balanced \ufb01xed-structure Bayesian network, it is suf\ufb01cient to learn all\nthe relevant conditional means. However, each sample x \u223c P gives us information about pi,a only if\nx \u2208 \u03a0i,a. To resolve this, we map each sample x to an m-dimensional vector F (x, q), and \u201c\ufb01ll in\u201d\nthe entries that correspond to conditional means for which the condition failed to happen. We will set\nthese coordinates to their empirical conditional means q:\nDe\ufb01nition 6. Let F (x, q) for {0, 1}d \u00d7 Rm \u2192 Rm be de\ufb01ned as follows: If x \u2208 \u03a0i,a, then\nF (x, q)i,a = xi, otherwise F (x, q)i,a = qi,a.\n\nWhen q = p (the true conditional means), the expectation of the (i, a)-th coordinate of F (X, p),\nfor X \u223c P , is the same conditioned on either \u03a0i,a or \u00ac\u03a0i,a. Using the conditional independence\nproperties of Bayesian networks, we will show that the covariance of F (x, p) is diagonal.\nLemma 7. For X \u223c P , we have E(F (X, p)) = p. The covariance matrix of F (X, p) satis\ufb01es\nCov[F (X, p)] = diag(PrP [\u03a0k]pk(1 \u2212 pk)).\nOur algorithm makes crucial use of Lemma 7 (in particular, that Cov[F (X, p)] is diagonal) to detect\nwhether or not the empirical conditional probability table of the noisy distribution is close to the\nconditional probabilities.\nFinally, we will need a suitable concentration inequality that works under conditional independence\nproperties. We can use Azuma\u2019s inequality to show that the projections of F (X, q) on any direction\nv is concentrated around the projection of the sample mean q.\nLemma 8. For X \u223c P , any unit vector v \u2208 Rd, and any q \u2208 [0, 1]m, we have\nPr[|v \u00b7 (F (X, q) \u2212 q)| \u2265 T + (cid:107)p \u2212 q(cid:107)2] \u2264 2 exp(\u2212T 2/2) .\n\n4\n\n\f3 Robust Learning Algorithm\n\nWe \ufb01rst look into the major ingredients required for our \ufb01ltering algorithm, and compare our proof\nwith that for product distributions in [DKK+16] on a more technical level.\nIn Section 2, we mapped each sample X to F (X, q) which contains information about the conditional\nmeans q, and we showed that it is suf\ufb01cient to learn the mean of F (X, q) to learn the ground-truth\nBayes net.\nLet M denote the empirical covariance matrix of (F (X, q) \u2212 q). We decompose M into three parts:\nOne coming from the ground-truth distribution, one coming from the subtractive error (because\nthe adversary can remove \u0001N good samples), and one coming from the additive error (because the\nadversary can add \u0001N bad samples). We will make use of the following observations:\n\n(1) The noise-free distribution has a diagonal covariance matrix.\n(2) The term coming from the subtractive error has no large eigenvalues.\n\nThese two observations imply that any large eigenvalues of M are due to the additive error. Finally,\nwe will reuse our concentration bounds to show that if the additive errors are frequently far from the\nmean in a known direction, then they can be reliably distinguished from good samples.\nFor the case of binary product distributions in [DKK+16], (1) is trivial because the coordinates\nare independent; but for Bayesian networks we need to expand the dimension of the samples and\n\ufb01ll in the missing entries properly. Condition (2) is due to concentration bounds, and for product\ndistributions it follows from standard Chernoff bounds, while for Bayes nets, we must instead rely on\nmartingale arguments and Azuma\u2019s inequality.\n\n3.1 Main Technical Lemma and Proof of Theorem 3\n\nFirst, we need to show that a large enough set of samples with no noise satisfy properties we expect\nfrom a representative set of samples. We need that the mean, covariance, and tail bounds of F (X, p)\nbehave like we would expect them to. We call a set of samples that satis\ufb01es these properties \u0001-good\nfor P .\n\nOur algorithm takes as input an \u0001-corrupted multiset S(cid:48) of N =(cid:101)\u2126(m log(1/\u03c4 )/\u00012) samples. We write\n\nS(cid:48) = (S \\ L) \u222a E, where S is the set of samples before corruption, L contains the good samples that\nhave been removed or (in later iterations) incorrectly rejected by \ufb01lters, and E represents the remaining\ncorrupted samples. We assume that S is \u0001-good. In the beginning, we have |E| + |L| \u2264 2\u0001|S|. As we\nadd \ufb01lters in each iteration, E gets smaller and L gets larger. However, we will prove that our \ufb01lter\nrejects more samples from E than S, so |E| + |L| must get smaller.\nWe will prove Theorem 3 by iteratively running the following ef\ufb01cient \ufb01ltering procedure:\nProposition 9 (Filtering). Let 0 < \u0001 < 1/2. Let P be a c-balanced Bayesian network on {0, 1}d\nwith known structure G. Assume each parental con\ufb01guration of P occurs with probability at least\n\n\u03b1 \u2265 \u2126(\u0001(cid:112)log(1/\u0001)/c). Let S(cid:48) = S \u222a E \\ L be a set of samples such that S is \u0001-good for P and\n|E| + |L| \u2264 2\u0001|S(cid:48)|. There is an algorithm that, given G, \u0001, and S(cid:48), runs in time (cid:101)O(d|S(cid:48)|), and either\n(i) Outputs a Bayesian network Q with dTV (P, Q) \u2264 \u0001(cid:112)ln(1/\u0001)/(c\u03b1), or\n\n(ii) Returns an S(cid:48)(cid:48) = S \u222a E(cid:48) \\ L(cid:48) such that |S(cid:48)(cid:48)| \u2264 (1 \u2212 \u0001\n\nd ln d )|S(cid:48)| and |E(cid:48)| + |L(cid:48)| < |E| + |L|.\nIf this algorithm produces a subset S(cid:48)(cid:48), then we iterate using S(cid:48)(cid:48) in place of S(cid:48). We will present the\nalgorithm establishing Proposition 9 in the following section. We \ufb01rst use it to prove Theorem 3.\n\nProof of Theorem 3. First a set S of N = (cid:101)\u2126(m log(1/\u03c4 )/\u00012) samples are drawn from P . We\n\nassume the set S is \u0001-good for P . Then an \u0001-fraction of these samples are adversarially corrupted,\ngiving a set S(cid:48) = S \u222a E \\ L with |E|,|L| \u2264 \u0001|S(cid:48)|. Thus S(cid:48) satis\ufb01es the conditions of Proposition\n9, and the algorithm outputs a smaller set S(cid:48)(cid:48) of samples that also satis\ufb01es the conditions of the\nproposition, or else outputs a Bayesian network Q with small dTV (P, Q) that satis\ufb01es Theorem 3.\nSince |S(cid:48)| decreases if we produce a \ufb01lter, eventually we must output a Bayesian network.\n\n5\n\n\fNext we analyze the running time. Observe that we can \ufb01lter out at most 2\u0001N samples, because we\nreject more bad samples than good ones. By Proposition 9, every time we produce a \ufb01lter, we remove\n\nat least (cid:101)\u2126(d/\u0001)|S(cid:48)| = (cid:101)\u2126(N d/\u0001) samples. Therefore, there are at most (cid:101)O(d) iterations, and each\niteration takes time (cid:101)O(d|S(cid:48)|) = (cid:101)O(N d) by Proposition 9, so the overall running time is (cid:101)O(N d2).\n\n3.2 Algorithm Filter-Known-Topology\n\nIn this section, we present Algorithm 1 that establishes Proposition 9. 2\n\nAlgorithm 1 Filter-Known-Topology\n1: Input: The dependency graph G of P , \u0001 > 0, and a (possibly corrupted) set of samples S(cid:48)\nfrom P . S(cid:48) satis\ufb01es that there exists an \u0001-good S with S(cid:48) = S \u222a E \\ L and |E| + |L| \u2264 2\u0001|S(cid:48)|.\n2: Output: A Bayes net Q or a subset S(cid:48)(cid:48) \u2282 S(cid:48) that satis\ufb01es Proposition 9.\n3: Compute the empirical conditional probabilities q(i, a) = PrX\u2208uS(cid:48) [Xi = 1 | \u03a0i,a].\n4: Compute the empirical minimum parental con\ufb01guration probability \u03b1 = min(i,a) PrS(cid:48)[\u03a0(i,a)].\n5: De\ufb01ne F (X, q): If x \u2208 \u03a0i,a then F (x, q)i,a = xi, otherwise F (x, q)i,a = qi,a (De\ufb01nition 6).\n6: Compute the empirical second-moment matrix of F (X, q) \u2212 q and zero its diagonal, i.e., M \u2208\nRm\u00d7m with Mk,k = 0, and Mk,(cid:96) = EX\u2208uS(cid:48)[(F (X, q)k \u2212 qk)(F (X, q)(cid:96) \u2212 q(cid:96))T ] for k (cid:54)= (cid:96).\n7: Compute the largest (in absolute value) eigenvalue \u03bb\u2217 of M, and the associated eigenvector v\u2217.\n8: if |\u03bb\u2217| \u2264 O(\u0001 log(1/\u0001)/\u03b1) then\n9:\n10: else\n11:\n\nLet \u03b4 := 3(cid:112)\u0001|\u03bb\u2217|/\u03b1. Pick any T > 0 that satis\ufb01es\n\nReturn Q = the Bayes net with graph G and conditional probabilities q.\n\nX\u2208uS(cid:48)[|v\u2217 \u00b7 (F (X, q) \u2212 q)| > T + \u03b4] > 7 exp(\u2212T 2/2) + 3\u00012/(T 2 ln d) .\n\nPr\n\nReturn S(cid:48)(cid:48) = the set of samples x \u2208 S(cid:48) with |v \u00b7 (F (x, q) \u2212 q)| \u2264 T + \u03b4.\n\nAt a high level, Algorithm 1 computes a matrix M, and shows that: either (cid:107)M(cid:107)2 is small, and we can\noutput the empirical conditional probabilities, or (cid:107)M(cid:107)2 is large, and we can use the top eigenvector\nof M to remove bad samples.\n\nSetup and Structural Lemmas.\nIn order to understand the second-moment matrix with zeros on\nthe diagonal, M, we will need to break down this matrix in terms of several related matrices, where the\nexpectation is taken over different sets. For a set D = S(cid:48), S, E or L, we use wD = |D|/|S(cid:48)| to denote\nthe fraction of the samples in D. Moreover, we use MD = EX\u2208uD[((F (X, q) \u2212 q)(F (X, q) \u2212 q)T ]\nto denote the second-moment matrix of samples in D, and let MD,0 be the matrix we get from\nzeroing out the diagonals of MD. Under this notation, we have MS(cid:48) = wSMS + wEME \u2212 wLML\nand M = MS(cid:48),0.\nOur \ufb01rst step is to analyze the spectrum of M, and in particular show that M is close in spectral norm\nto wEME. To do this, we begin by showing that the spectral norm of MS,0 is relatively small. Since\nS is good, we have bounds on the second moments F (X, p). We just need to deal with the error from\nreplacing p with q:\n\nLemma 10. (cid:107)MS,0(cid:107)2 \u2264 O(\u0001 +(cid:112)(cid:80)\n\nk PrS[\u03a0k](pk \u2212 qk)2 +(cid:80)\n\nk PrS[\u03a0k](pk \u2212 qk)2).\n\nNext, we wish to bound the contribution to M coming from the subtractive error. We show that this\nis small due to concentration bounds on P and hence on S. The idea is that for any unit vector v, we\nhave tail bounds for the random variable v \u00b7 (F (X, q) \u2212 q) and, since L is a subset of S, L can at\nworst consist of a small fraction of the tail of this distribution.\nLemma 11. wL(cid:107)ML(cid:107)2 \u2264 O(\u0001 log(1/\u0001) + \u0001(cid:107)p \u2212 q(cid:107)2\n2).\nFinally, combining the above results, since MS and ML have small contribution to the spectral norm\nof M when (cid:107)p \u2212 q(cid:107)2 is small, most of it must come from ME.\nLemma 12.\n\n(cid:107)M \u2212 wEME(cid:107)2 \u2264 O(cid:0)\u0001 log(1/\u0001) +(cid:112)(cid:80)\n\nk PrS(cid:48)[\u03a0k](pk \u2212 qk)2 +(cid:80)\n\nk PrS(cid:48)[\u03a0k](pk \u2212 qk)2(cid:1).\n\n2 We use X \u2208u S to denote that the point X is drawn uniformly from the set of samples S.\n\n6\n\n\fLemma 12 follows using the identity |S(cid:48)|M = |S|MS,0 + |E|ME,0 \u2212 |L|ML,0 and bounding the\nerrors due to the diagonals of ME and ML.\n\nIn this section, we will prove that if (cid:107)M(cid:107)2 = O(\u0001 log(1/\u0001)/\u03b1),\nThe Case of Small Spectral Norm.\nthen we can output the empirical conditional means q. Recall that MS(cid:48) = EX\u2208uS(cid:48)[(F (X, q)i \u2212\nqi)(F (X, q)j \u2212 qj)T ] and M = MS(cid:48),0.\nWe \ufb01rst show that the contributions that L and E make to EX\u2208uS(cid:48)[F (X, q) \u2212 q)] can be bounded in\nterms of the spectral norms of ML and ME. It follows from the Cauchy-Schwarz inequality that:\n\nLemma 13. (cid:107)EX\u2208uL[F (X, q) \u2212 q](cid:107)2 \u2264(cid:112)(cid:107)ML(cid:107)2 and (cid:107)EX\u2208uE[F (X, q) \u2212 q](cid:107)2 \u2264(cid:112)(cid:107)ME(cid:107)2.\n\nCombining with the results about these norms in Section 3.2, Lemma 13 implies that if (cid:107)M(cid:107)2 is\nsmall, then q = EX\u2208uS(cid:48)[F (X, q)] is close to EX\u2208uS[F (X, q)], which is then necessarily close to\nEX\u223cP [F (X, p)] = p. The following lemma states that the mean of (F (X, q) \u2212 q) under the good\nsamples is close to (p \u2212 q) scaled by the probabilities of parental con\ufb01gurations under S(cid:48):\nLemma 14. Let z \u2208 Rm be the vector with zk = PrS(cid:48)[\u03a0k](pk \u2212 qk). Then (cid:107)EX\u2208uS[F (X, q)\u2212 q]\u2212\nz(cid:107)2 \u2264 O(\u0001(1 + (cid:107)p \u2212 q(cid:107)2)).\nNote that z is closely related to the total variation distance between P and Q (the Bayes net with\nconditional probabilities q). We can write (EX\u2208uS(cid:48)[F (X, q)] \u2212 q) in terms of this expectation\nunder S, E, and L whose distance from q can be upper bounded using the previous lemmas. Using\nLemmas 11, 12, 13, and 14, we can bound (cid:107)z(cid:107)2 in terms of (cid:107)M(cid:107)2:\n\nk PrS(cid:48)[\u03a0k]2(pk \u2212 qk)2 \u2264 2(cid:112)\u0001(cid:107)M(cid:107)2 + O(\u0001(cid:112)log(1/\u0001) + 1/\u03b1).\n\nk PrP [\u03a0k](pk \u2212 qk)2 is small. We can do so by losing a factor of 1/\n\nk PrS(cid:48)[\u03a0k]2(pk \u2212 qk)2. We can then use it\n\u03b1 to remove\nS[\u03a0k] = \u0398(mink PrP [\u03a0k]) when it is at least a\nk PrP [\u03a0k](pk \u2212 qk)2 is small, Lemma 5 tells us that dTV (P, Q)\nis small. This completes the proof of the \ufb01rst case of Proposition 9.\nCorollary 16 (Part (i) of Proposition 9). If (cid:107)M(cid:107)2 \u2264 O(\u0001 log(1/\u0001)/\u03b1), then dTV (P, Q) =\n\n\u221a\n\nthe square on PrS(cid:48)[\u03a0k], and showing that mink Pr(cid:48)\n\nLemma 15. (cid:112)(cid:80)\nLemma 15 implies that, if (cid:107)M(cid:107)2 is small then so is(cid:112)(cid:80)\nto show that(cid:112)(cid:80)\nlarge multiple of \u0001. Finally, if(cid:112)(cid:80)\nO(\u0001(cid:112)log(1/\u0001)/(c mink PrP [\u03a0k])).\nClaim 17. (cid:107)p \u2212 q(cid:107)2 \u2264 \u03b4 := 3(cid:112)\u0001(cid:107)M(cid:107)2/\u03b1.\n\nThe Case of Large Spectral Norm. Now we consider the case when (cid:107)M(cid:107)2 \u2265 C\u0001 ln(1/\u0001)/\u03b1. We\nbegin by showing that p and q are not too far apart from each other. The bound given by Lemma 15\nis now dominated by the (cid:107)M(cid:107)2 term. Lower bounding the PrS(cid:48)[\u03a0k] by \u03b1 gives the following claim.\n\n2 v\u2217T M v\u2217.\n\nRecall that v\u2217 is the largest eigenvector of M. We project all the points F (X, q) onto the direction of\nv\u2217. Next we show that most of the variance of (v\u2217 \u00b7 (F (X, q) \u2212 q)) comes from E.\nClaim 18. v\u2217T (wEME)v\u2217 \u2265 1\nClaim 18 follows from the observation that (cid:107)M \u2212 wEME(cid:107)2 is much smaller than (cid:107)M(cid:107)2. This is\nobtained by substituting the bound on (cid:107)p \u2212 q(cid:107)2 (in terms of (cid:107)M(cid:107)2) from Claim 17 into the bound on\n(cid:107)M \u2212 wEME(cid:107)2 given by Lemma 12.\nClaim 18 implies that the tails of wEE are reasonably thick. In particular, we show that there must\nbe a threshold T > 0 satisfying the desired property in Step 9 of our algorithm.\nLemma 19. There exists a T \u2265 0 such that\n\nX\u2208uS(cid:48)[|v \u00b7 (F (X, q) \u2212 q)| > T + \u03b4] > 7 exp(\u2212T 2/2) + 3\u0001/(T 2 ln d) .\n\nPr\n\nIf Lemma 19 were not true, by integrating this tail bound, we can show that v\u2217T MEv\u2217 would be\nsmall. Therefore, Step 11 of Algorithm 11 is guaranteed to \ufb01nd some valid threshold T > 0.\nFinally, we show that the set of samples S(cid:48)(cid:48) we return after the \ufb01lter is better than S(cid:48) in terms of\n|L| + |E|. This completes the proof of the second case of Proposition 9.\nClaim 20 (Part (ii) of Proposition 9). If we write S(cid:48)(cid:48) = S \u222a E(cid:48) \\ L(cid:48), then |E(cid:48)| + |L(cid:48)| < |E| + |L|\nand |S(cid:48)(cid:48)| \u2264 (1 \u2212 \u0001\n\nd ln d )|S(cid:48)|.\n\n7\n\n\fClaim 9 follows from the fact that S is \u0001-good, so we only remove at most (3 exp(T 2/2) +\n\u0001/T 2 log d)|S| samples from S. Since we remove more than twice as many samples from S(cid:48),\nmost of the samples we throw away are from E. Moreover, we remove at least (1 \u2212 \u0001\nd ln d )|S(cid:48)|\nsamples because we can show that the threshold T is at most\n\n\u221a\n\nd.\n\neach iteration, we implement matrix-vector multiplication with M by writing M v as(cid:80)\n\nRunning Time of Our Algorithm 1 First, q and \u03b1 can be computed in time O(N d) because each\nsample only affects d entries of q. We do not explicitly write down F (X, q) or M. Then, we use\nthe power method to compute the largest eigenvalue \u03bb\u2217 of M and the associated eigenvector v\u2217. In\ni((F (xi, q)\u2212\nq)T v)(F (xi, q)\u2212 q) for any vector v \u2208 Rm. Because each (F (xi, q)\u2212 q) is d-sparse, computing M v\ntakes time O(dN ). The power method takes (log m/\u0001(cid:48)) iterations to \ufb01nd a (1 \u2212 \u0001(cid:48))-approximately\nlargest eigenvalue. We can set \u0001(cid:48) to a small constant, because we can tolerate a small multiplicative\nerror in estimating the spectral norm of M and we only need an approximate top eigenvector (see, e.g.,\nCorollary 16 and Lemma 18). Thus, the power method takes time O(dN log m). Finally, computing\n|v\u2217 \u00b7 (F (x, q) \u2212 q)| takes time O(dN ), then we can sort the samples and \ufb01nd a threshold T in time\nO(N log N ), and throw out the samples in time O(N ).\n\n4 Experiments\n\nWe test our algorithms using data generated from both synthetic and real-world networks (e.g., the\nALARM network [BSCC89]) with synthetic noise. All experiments were run on a laptop with 2.6\nGHz CPU and 8 GB of RAM. We found that our algorithm achieves the smallest error consistently in\nall trials, and that the error of our algorithm almost matches the error of the empirical conditional\nprobabilities of the uncorrupted samples. Moreover, our algorithm can easily scale to thousands of\ndimensions with millions of samples. 3\n\n4.1 Synthetic Experiments\n\nThe results of our synthetic experiments are shown in Figure 1. In the synthetic experiment, we\nset \u0001 = 0.1 and \ufb01rst generate a Bayes net P with 100 \u2264 m \u2264 1000 parameters. We then generate\nsamples, where a (1 \u2212 \u0001)-fraction of the samples come from the ground truth P , and\nN = 10m\n\u00012\nthe remaining \u0001-fraction come from a noise distribution. The goal is to output a Bayes net Q that\nminimizes dTV (P, Q).\n\nFigure 1: Experiments with synthetic data: error is reported against the size of the conditional\nprobability table (lower is better). The error is the estimated total variation distance to the ground\ntruth Bayes net. We use the error of MLE without noise as our benchmark. We plot the performance\nof our algorithm (Filtering), empirical mean with noise (MLE), and RANSAC. We report two settings:\nthe underlying structure of the Bayes net is a random tree (left) or a random graph (right).\nWe draw the parameters of P independently from [0, 1/4] \u222a [3/4, 1] uniformly at random, i.e., in a\nsetting where the \u201cbalancedness\u201d assumption does not hold. Our experiments show that our \ufb01ltering\nalgorithm works very well in this setting, even when the assumptions under which we can prove\ntheoretical guarantees are not satis\ufb01ed. This complements our theoretical results and illustrates that\nour algorithm is not limited by these assumptions and can apply to more general settings in practice.\n\n3 The bottleneck of our algorithm is \ufb01tting millions of samples of thousands dimension all in the memory.\n\n8\n\n00.10.20.30.40.50.60.70.80.91.001002003004005006007008009001000Numberofparameters(m)EstimateddTV++++++++++\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7ldldldldldldldldldldbcbcbcbcbcbcbcbcbcbc+MLEw/onoise\u00d7FilteringldMLEw/noisebcRANSAC00.10.20.30.40.50.60.70.80.91.001002003004005006007008009001000Numberofparameters(m)EstimateddTV++++++++++\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7ldldldldldldldldldldbcbcbcbcbcbcbcbcbcbc+MLEw/onoise\u00d7FilteringldMLEw/noisebcRANSAC\fIn Figure 1, we compare the performance of (1) our \ufb01ltering algorithm, (2) the empirical conditional\nprobability table with noise, and (3) a RANSAC-based algorithm (see the end of Section 4 for a\ndetailed description). We use the error of the empirical conditional mean without noise (i.e., MLE\nestimator with only good samples) as the gold standard, since this is the best one could hope for even\nif all the corrupted samples are identi\ufb01ed. We tried various graph structures for the Bayes net P and\nnoise distributions, and similar patterns arise for all of them. In the top \ufb01gure, the dependency graph\nof P is a randomly generated tree, and the noise distribution is a binary product distribution; In the\nbottom \ufb01gure, the dependency graph of P is a random graph, and the noise distribution is the tree\nBayes net used as the ground truth in the \ufb01rst experiment.\n\n4.2 Semi-Synthetic Experiments\n\nIn the semi-synthetic experiments, we apply our algorithm to robustly learn real-world Bayesian\nnetworks. The ALARM network [BSCC89] is a classic Bayes net that implements a medical\ndiagnostic system for patient monitoring.\nOur experimental setup is as follows: The underlying graph of ALARM has 37 nodes and 509\nparameters. Since the variables in ALARM can have up to 4 different values, we \ufb01rst transform it\ninto an equivalent binary-valued Bayes net. After the transformation, the network has d = 61 nodes\nand m = 820 parameters. We are interested in whether our \ufb01ltering algorithm can learn a Bayes net\nthat is \u201cclose\u201d to ALARM when samples are corrupted; and how many corrupted samples can our\nalgorithm tolerate. For \u0001 = [0.05, 0.1, . . . , 0.4], we draw N = 106 samples, where a (1 \u2212 \u0001)-fraction\nof the samples come from ALARM, and the other \u0001-fraction comes from a noise distribution.\n\nFigure 2: Experiments with semi-synthetic data:\nerror is reported against the fraction of corrupted\nsamples (lower is better). The error is the esti-\nmated total variation distance to the ALARM net-\nwork. We use the sampling error without noise\nas a benchmark, and compare the performance of\nour algorithm (Filtering), empirical mean with\nnoise (MLE), and RANSAC.\n\nIn Figure 2, we compare the performance of (1) our \ufb01ltering algorithm, (2) the empirical conditional\nmeans with noise, and (3) a RANSAC-based algorithm. We use the error of the empirical conditional\nmeans without noise as the gold standard. We tried various noise distributions and observed similar\npatterns. In Figure 2, the noise distribution is a Bayes net with random dependency graphs and\nconditional probabilities drawn from [0, 1\n4 , 1] (same as the ground-truth Bayes net in Figure 1).\nThe experiments show that our \ufb01ltering algorithm outperforms MLE and RANSAC, and that the error of\nour algorithm degrades gracefully as \u0001 increases. It is worth noting that even the ALARM network\ndoes not satisfy our balancedness assumption on the parameters, our algorithm still performs well on\nit and recovers the conditional probability table of ALARM in the presence of corrupted samples.\n\n4 ] \u222a [ 3\n\n5 Conclusions and Future Directions\n\nIn this paper, we initiated the study of the ef\ufb01cient robust learning for graphical models. We described\na computationally ef\ufb01cient algorithm for robustly learning Bayesian networks with a known topology,\nunder some mild assumptions on the conditional probability table. We evaluate our algorithm\nexperimentally, and we view our experiments as a proof of concept demonstration that our techniques\ncan be practical for learning \ufb01xed-structure Bayesian networks. A challenging open problem is to\ngeneralize our results to the case when the underlying directed graph is unknown.\nThis work is part of a broader agenda of systematically investigating the robust learnability of high-\ndimensional structured probability distributions. There is a wealth of natural probabilistic models that\nmerit investigation in the robust setting, including undirected graphical models (e.g., Ising models),\nand graphical models with hidden variables (i.e., incorporating latent structure).\n\n9\n\n00.10.20.30.40.50.60.70.80.91.000.050.100.150.200.250.300.350.40Fractionofcorruptedsamples(\u01eb)EstimateddTV++++++++\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7ldldldldldldldldbcbcbcbcbcbcbcbc+MLEw/onoise\u00d7FilteringldMLEw/noisebcRANSAC\fAcknowledgements. We are grateful to Daniel Hsu for suggesting the model of Bayes nets, and for\npointing us to [Das97]. Yu Cheng is supported in part by NSF CCF-1527084, CCF-1535972, CCF-\n1637397, CCF-1704656, IIS-1447554, and NSF CAREER Award CCF-1750140. Ilias Diakonikolas\nis supported by NSF CAREER Award CCF-1652862 and a Sloan Research Fellowship. Daniel Kane\nis supported by NSF CAREER Award CCF-1553288 and a Sloan Research Fellowship.\n\n10\n\n\fReferences\n[ADLS16] J. Acharya, I. Diakonikolas, J. Li, and L. Schmidt. Fast algorithms for segmented\nregression. In Proceedings of the 33nd International Conference on Machine Learning,\nICML 2016, pages 2878\u20132886, 2016.\n\n[ADLS17] J. Acharya, I. Diakonikolas, J. Li, and L. Schmidt. Sample-optimal density estimation in\nnearly-linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium\non Discrete Algorithms, SODA 2017, pages 1278\u20131289, 2017.\n\n[AHHK12] A. Anandkumar, D. J. Hsu, F. Huang, and S. Kakade. Learning mixtures of tree\ngraphical models. In Proc. 27th Annual Conference on Neural Information Processing\nSystems (NIPS), pages 1061\u20131069, 2012.\n\n[AKN06] P. Abbeel, D. Koller, and A. Y. Ng. Learning factor graphs in polynomial time and\n\nsample complexity. J. Mach. Learn. Res., 7:1743\u20131788, 2006.\n\n[BDLS17] S. Balakrishnan, S. S. Du, J. Li, and A. Singh. Computationally ef\ufb01cient robust sparse\nestimation in high dimensions. In Proc. 30th Annual Conference on Learning Theory\n(COLT), pages 169\u2013212, 2017.\n\n[Ber06] T. Bernholt. Robust estimators are hard to compute. Technical report, University of\n\nDortmund, Germany, 2006.\n\n[BGS14] G. Bresler, D. Gamarnik, and D. Shah. Structure learning of antiferromagnetic Ising\n\nmodels. In NIPS, pages 2852\u20132860, 2014.\n\n[BMS13] G. Bresler, E. Mossel, and A. Sly. Reconstruction of Markov random \ufb01elds from\nsamples: Some observations and algorithms. SIAM J. Comput., 42(2):563\u2013578, 2013.\n\n[Bre15] G. Bresler. Ef\ufb01ciently learning Ising models on arbitrary graphs. In Proc. 47th Annual\n\nACM Symposium on Theory of Computing (STOC), pages 771\u2013782, 2015.\n\n[BSCC89] I. A. Beinlich, H. J. Suermondt, R. M. Chavez, and G. F. Cooper. The ALARM\nMonitoring System: A Case Study with two Probabilistic Inference Techniques for\nBelief Networks. Springer, 1989.\n\n[CDKS17] C. L. Canonne, I. Diakonikolas, D. M. Kane, and A. Stewart. Testing Bayesian networks.\nIn Proc. 30th Annual Conference on Learning Theory (COLT), pages 370\u2013448, 2017.\n\n[CDSS13] S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Learning mixtures of structured\ndistributions over discrete domains. In Proc. 24th Annual Symposium on Discrete\nAlgorithms (SODA), pages 1380\u20131394, 2013.\n\n[CDSS14a] S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Ef\ufb01cient density estimation via\npiecewise polynomial approximation. In Proc. 46th Annual ACM Symposium on Theory\nof Computing (STOC), pages 604\u2013613, 2014.\n\n[CDSS14b] S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Near-optimal density estimation in\nnear-linear time using variable-width histograms. In Proc. 29th Annual Conference on\nNeural Information Processing Systems (NIPS), pages 1844\u20131852, 2014.\n\n[CGR15] M. Chen, C. Gao, and Z. Ren. Robust covariance and scatter matrix estimation under\n\nHuber\u2019s contamination model. CoRR, abs/1506.00691, 2015.\n\n[CGR16] M. Chen, C. Gao, and Z. Ren. A general decision theory for Huber\u2019s \u0001-contamination\n\nmodel. Electronic Journal of Statistics, 10(2):3752\u20133774, 2016.\n\n[CL68] C. Chow and C. Liu. Approximating discrete probability distributions with dependence\n\ntrees. IEEE Trans. Inf. Theor., 14(3):462\u2013467, 1968.\n\n[Das97] S. Dasgupta. The sample complexity of learning \ufb01xed-structure Bayesian networks.\n\nMachine Learning, 29(2-3):165\u2013180, 1997.\n\n11\n\n\f[DDS14] C. Daskalakis, I. Diakonikolas, and R. A. Servedio. Learning k-modal distributions via\n\ntesting. Theory of Computing, 10(20):535\u2013570, 2014.\n\n[DKK+16] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robust\nestimators in high dimensions without the computational intractability. In Proc. 57th\nIEEE Symposium on Foundations of Computer Science (FOCS), 2016.\n\n[DKK+17] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Being\nrobust (in high dimensions) can be practical. In Proc. 34th International Conference on\nMachine Learning (ICML), pages 999\u20131008, 2017.\n\n[DKK+18a] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robustly\nIn Proc. 29th ACM-SIAM\n\nlearning a Gaussian: Getting optimal error, ef\ufb01ciently.\nSymposium on Discrete Algorithms (SODA), 2018.\n\n[DKK+18b] I. Diakonikolas, G. Kamath, D. M Kane, J. Li, J. Steinhardt, and A. Stewart. Sever: A\nrobust meta-algorithm for stochastic optimization. arXiv preprint arXiv:1803.02815,\n2018.\n\n[DKS17] I. Diakonikolas, D. M. Kane, and A. Stewart. Statistical query lower bounds for robust\nestimation of high-dimensional Gaussians and Gaussian mixtures. In Proc. 58th IEEE\nSymposium on Foundations of Computer Science (FOCS), pages 73\u201384, 2017.\n\n[DKS18a] I. Diakonikolas, D. M. Kane, and A. Stewart. Learning geometric concepts with nasty\nnoise. In Proc. 50th Annual ACM Symposium on Theory of Computing (STOC), pages\n1061\u20131073, 2018.\n\n[DKS18b] I. Diakonikolas, D. M. Kane, and A. Stewart. List-decodable robust mean estimation\nand learning mixtures of spherical Gaussians. In Proc. 50th Annual ACM Symposium\non Theory of Computing (STOC), pages 1047\u20131060, 2018.\n\n[DKS18c] I. Diakonikolas, W. Kong, and A. Stewart. Ef\ufb01cient algorithms and lower bounds for\n\nrobust linear regression. CoRR, abs/1806.00040, 2018.\n\n[DLS18] I. Diakonikolas, J. Li, and L. Schmidt. Fast and sample near-optimal algorithms for\nlearning multidimensional histograms. In Conference On Learning Theory, COLT 2018,\npages 819\u2013842, 2018.\n\n[DSA11] R. Daly, Q. Shen, and S. Aitken. Learning Bayesian networks: approaches and issues.\n\nThe Knowledge Engineering Review, 26:99\u2013157, 2011.\n\n[HL18] S. B. Hopkins and J. Li. Mixture models, robustness, and sum of squares proofs. In Proc.\n50th Annual ACM Symposium on Theory of Computing (STOC), pages 1021\u20131034,\n2018.\n\n[HM13] M. Hardt and A. Moitra. Algorithms and hardness for robust subspace recovery. In\n\nProc. 26th Annual Conference on Learning Theory (COLT), pages 354\u2013375, 2013.\n\n[HR09] P. J. Huber and E. M. Ronchetti. Robust statistics. Wiley New York, 2009.\n\n[HRRS86] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel. Robust statistics:\n\nThe approach based on in\ufb02uence functions. Wiley New York, 1986.\n\n[Hub64] P. J. Huber. Robust estimation of a location parameter. Ann. Math. Statist., 35(1):73\u2013\n\n101, 03 1964.\n\n[JN07] F. V. Jensen and T. D. Nielsen. Bayesian Networks and Decision Graphs. Springer\n\nPublishing Company, Incorporated, 2nd edition, 2007.\n\n[JP78] D. S. Johnson and F. P. Preparata. The densest hemisphere problem. Theoretical\n\nComputer Science, 6:93\u2013107, 1978.\n\n[KF09] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques\n\n- Adaptive Computation and Machine Learning. The MIT Press, 2009.\n\n12\n\n\f[KKM18] A. Klivans, P. Kothari, and R. Meka. Ef\ufb01cient algorithms for outlier-robust regression.\nIn Proc. 31st Annual Conference on Learning Theory (COLT), pages 1420\u20131430, 2018.\n\n[KSS18] P. K. Kothari, J. Steinhardt, and D. Steurer. Robust moment estimation and improved\nclustering via sum of squares. In Proc. 50th Annual ACM Symposium on Theory of\nComputing (STOC), pages 1035\u20131046, 2018.\n\n[LRV16] K. A. Lai, A. B. Rao, and S. Vempala. Agnostic estimation of mean and covariance. In\n\nProc. 57th IEEE Symposium on Foundations of Computer Science (FOCS), 2016.\n\n[LSLC18] L. Liu, Y. Shen, T. Li, and C. Caramanis. High dimensional robust sparse regression.\n\nCoRR, abs/1805.11643, 2018.\n\n[LW12] P. L. Loh and M. J. Wainwright. Structure estimation for discrete graphical models:\nGeneralized covariance matrices and their inverses. In NIPS, pages 2096\u20132104, 2012.\n\n[Nea03] R. E. Neapolitan. Learning Bayesian Networks. Prentice-Hall, Inc., 2003.\n\n[PSBR18] A. Prasad, A. S. Suggala, S. Balakrishnan, and P. Ravikumar. Robust estimation via\n\nrobust gradient estimation. arXiv preprint arXiv:1802.06485, 2018.\n\n[SW12] N. P. Santhanam and M. J. Wainwright. Information-theoretic limits of selecting binary\ngraphical models in high dimensions. IEEE Trans. Information Theory, 58(7):4117\u2013\n4134, 2012.\n\n[Tuk75] J. W. Tukey. Mathematics and the picturing of data. In Proceedings of the International\n\nCongress of Mathematicians, volume 6, pages 523\u2013531, 1975.\n\n[WJ08] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and\n\nvariational inference. Found. Trends Mach. Learn., 1(1-2):1\u2013305, 2008.\n\n[WRL06] M. J. Wainwright, P. Ravikumar, and J. D. Lafferty. High-dimensional graphical model\nselection using (cid:96)1-regularized logistic regression. In Proc. 20th Annual Conference on\nNeural Information Processing Systems (NIPS), pages 1465\u20131472, 2006.\n\n13\n\n\f", "award": [], "sourceid": 6591, "authors": [{"given_name": "Yu", "family_name": "Cheng", "institution": "Duke University"}, {"given_name": "Ilias", "family_name": "Diakonikolas", "institution": "University of Southern California"}, {"given_name": "Daniel", "family_name": "Kane", "institution": "UCSD"}, {"given_name": "Alistair", "family_name": "Stewart", "institution": "University of Southern California"}]}