{"title": "Fast Variational Inference for Large-scale Internet Diagnosis", "book": "Advances in Neural Information Processing Systems", "page_first": 1169, "page_last": 1176, "abstract": "Web servers on the Internet need to maintain high reliability, but the cause of intermittent failures of web transactions is non-obvious. We use Bayesian inference to diagnose problems with web services. This diagnosis problem is far larger than any previously attempted: it requires inference of 10^4 possible faults from 10^5 observations. Further, such inference must be performed in less than a second. Inference can be done at this speed by combining a variational approximation, a mean-field approximation, and the use of stochastic gradient descent to optimize a variational cost function. We use this fast inference to diagnose a time series of anomalous HTTP requests taken from a real web service. The inference is fast enough to analyze network logs with billions of entries in a matter of hours.", "full_text": "Fast Variational Inference\n\nfor Large-scale Internet Diagnosis\n\nJohn C. Platt\n\nEmre K\u0131c\u0131man\n\nMicrosoft Research\n\n1 Microsoft Way\n\nRedmond, WA 98052\n\n{jplatt,emrek,dmaltz}@microsoft.com\n\nDavid A. Maltz\n\nAbstract\n\nWeb servers on the Internet need to maintain high reliability, but the cause\nof intermittent failures of web transactions is non-obvious. We use approx-\nimate Bayesian inference to diagnose problems with web services. This\ndiagnosis problem is far larger than any previously attempted: it requires\ninference of 104 possible faults from 105 observations. Further, such infer-\nence must be performed in less than a second. Inference can be done at\nthis speed by combining a mean-\ufb01eld variational approximation and the\nuse of stochastic gradient descent to optimize a variational cost function.\nWe use this fast inference to diagnose a time series of anomalous HTTP\nrequests taken from a real web service. The inference is fast enough to\nanalyze network logs with billions of entries in a matter of hours.\n\n1 Introduction\n\nInternet content providers, such as MSN, Google and Yahoo, all depend on the correct\nfunctioning of the wide-area Internet to communicate with their users and provide their\nservices. When these content providers lose network connectivity with some of their users,\nit is critical that they quickly resolve the problem, even if the failure lies outside their own\n1 One challenge is that content providers have little direct visibility into the\nsystems.\nwide-area Internet infrastructure and the causes of user request failures. Requests may fail\nbecause of problems in the content provider\u2019s systems or faults in the network infrastructure\nanywhere between the user and the content provider, including routers, proxies, \ufb01rewalls,\nand DNS servers. Other failing requests may be due to denial of service attacks or bugs in\nthe user\u2019s software. To compound the diagnosis problem, these faults may be intermittent:\nwe must use probabilistic inference to perform diagnosis, rather than using logic.\n\nA second challenge is the scale involved. Not only do popular Internet content providers\nreceive billions of HTTP requests a week, but the number of potential causes of failure are\nnumerous. Counting only the coarse-grained Autonomous Systems (ASes) through which\nusers receive Internet connectivity, there are over 20k potential causes of failure. In this\npaper, we show that approximate Bayesian inference scales to handle this high rate of\nobservations and accurately estimates the underlying failure rates of such a large number of\npotential causes of failure.\n\nTo scale Bayesian inference to Internet-sized problems, we must make several simplifying\napproximations. First, we introduce a bipartite graphical model using overlapping noisy-\nORs, to model the interactions between faults and observations. Second, we use mean-\n\n1A loss of connectivity to users translates directly into lost revenue and a sullied reputation for\n\ncontent providers, even if the cause of the problem is a third-party network component.\n\n1\n\n\f\ufb01eld variational inference to map the diagnosis problem to a reasonably-sized optimization\nproblem. Third, we further approximate the integral in the variational method. Fourth, we\nspeed up the optimization problem using stochastic gradient descent.\n\nThe paper is structured as follows: Section 1.1 discusses related work to this paper. We\ndescribe the graphical model in Section 2, and the approximate inference in that model\nin Section 2.1, including stochastic gradient descent (in Section 3). We present inference\nresults on synthetic and real data in Section 4 and then draw conclusions.\n\n1.1 Previous Work\n\nThe original application of Bayesian diagnosis was medicine. One of the original diagno-\nsis network was QMR-DT [14], a bipartite graphical model that used noisy-OR to model\nsymptoms given diseases. Exact inference in such networks is intractable (exponential in the\nnumber of positive symptoms,[2]), so di\ufb00erent approximation and sampling algorithms were\nproposed. Shwe and Cooper proposed likelihood-weighted sampling [13], while Jaakkola\nand Jordan proposed using a variational approximation to unlink each input to the net-\nwork [3]. With only thousands of possible symptoms and hundreds of diseases, QMR-DT\nwas considered very challenging.\n\nMore recently, researchers have applied Bayesian techniques for the diagnosis of computers\nand networks [1][12][16]. This work has tended to avoid inference in large networks, due to\nspeed constraints. In contrast, we attack the enormous inference problem directly.\n\n2 Graphical model of diagnosis\n\nBernoulli\n\nBeta\n\nNoisy\nOR\n\nFigure 1: The full graphical model for the diagnosis of Internet faults\n\nThe initial graphical model for diagnosis is shown in Figure 1. Starting at the bottom, we\nobserve a large number of binary random variables, each corresponding to the success/failure\nof a single HTTP request. The failure of an HTTP request can be modeled as a noisy-OR [11]\nof a set of Bernoulli-distributed binary variables, each of which models the underlying factors\nthat can cause a request to fail:\n\nP (Vi = fail|Dij) = 1 \u2212 (1 \u2212 ri0)Yj\n\n(1 \u2212 rijdij),\n\n(1)\n\nwhere rij is the probability that the observation is a failure if a single underlying fault dij\nis present. The matrix rij is typically very sparse, because there are only a small number of\npossible causes for the failure of any request. The ri0 parameter models the probability of a\nspontaneous failure without any known cause. The rij are set by elicitation of probabilities\nfrom an expert.\n\nThe noisy-OR models the causal structure in the network, and its connections are derivable\nfrom the metadata associated with the HTTP request. For example, a single request can fail\n\n2\n\n\fFigure 2: Graphical model after integrating out instantaneous faults: a bipartite noisy-OR\nnetwork with Beta distributions as hidden variables\n\nbecause its server has failed, or because a miscon\ufb01gured or overloaded router can cause an\nAS to lose connectivity to the content provider, or because the user agent is not compatible\nwith the service. All of these underlying causes are modeled independently for each request,\nbecause possible faults in the system can be intermittent.\n\nEach of the Bernoulli variables Dij depends on an underlying continuous fault rate variable\nFj \u2208 [0, 1]:\n\nP (Dij|Fj = \u00b5j) = \u00b5dij\n\nj (1 \u2212 \u00b5j)1\u2212dij ,\n\n(2)\n\nwhere \u00b5j is the probability of a fault manifesting at any time. We model the Fj as inde-\npendent Beta distributions, one for each fault:\n\np(Fj = \u00b5j) =\n\n1\nj , \u03b2 0\nB(\u03b10\nj )\n\nj \u22121\n\n\u03b10\n\u00b5\nj\n\n(1 \u2212 \u00b5j)\u03b2 0\n\nj \u22121,\n\n(3)\n\nwhere B is the beta function. The fan-out for each of these fault rates can be di\ufb00erent:\nsome of these fault rates are connected to many observations, while less common ones are\nconnected to fewer.\nOur goal is to model the posterior distribution P ( ~F |~V ) in order to identify hidden faults\nand track them through time. The existence of the Dij random variable is a nuisance. We\ndo not want to estimate P ( ~D|~V ) for any Dij: the distribution of instantaneous problems is\nnot interesting. Fortunately, we can exactly integrate out these nuisance variables, because\nthey are connected to only one observation thru a noisy-OR.\n\nAfter integrating out the Dij, the graphical model is shown in Figure 2. The model is now\ncompletely analogous to the QMR-DT mode [14], but instead of the noisy-OR combining\nbinary random variables, they combine rate variables:\n\nP (Vi = fail|Fj = \u00b5j) = 1 \u2212 (1 \u2212 ri0)Yj\n\n(1 \u2212 rij\u00b5j).\n\n(4)\n\nOne can view (4) as a generalization of a noisy-OR to continuous [0, 1] variables.\n\n2.1 Approximations to make inference tractable\n\nIn order to scale inference up to 104 hidden variables, and 105 observations, we choose a\nsimple, robust approximate inference algorithm: mean-\ufb01eld variational inference [4]. Mean-\n\ufb01eld variational inference approximates the posterior P ( ~F |~V ) with a factorized distribution.\nFor inferring fault rates, we choose to approximate P with a product of beta distributions\n\nQ( ~F |~V ) =Yj\n\nq(Fj|~V ) =Yj\n\n1\n\nB(\u03b1j, \u03b2j)\n\n\u00b5\u03b1j \u22121\n\nj\n\n(1 \u2212 \u00b5j)\u03b2j \u22121.\n\n(5)\n\n3\n\n\fMean-\ufb01eld variational inference maximizes a lower bound on the evidence of the model:\n\nmax\n~\u03b1,~\u03b2\n\nL =Z Q(~\u00b5|~V ) log\n\nP (~V |~\u00b5)p(~\u00b5)\n\nQ(~\u00b5|~V )\n\nd~\u00b5.\n\n(6)\n\nThis integral can be broken into two terms: a cross-entropy between the approximate pos-\nterior and the prior, and an expected log-likelihood of the observations:\n\nmax\n~\u03b1,~\u03b2\n\nL = \u2212Z Q(~\u00b5|~V ) log\n\nQ(~\u00b5|~V )\n\np(~\u00b5)\n\nd~\u00b5 +Dlog P (~V | ~F )EQ\n\n.\n\n(7)\n\nThe \ufb01rst integral is the negative of a sum of cross-entropies between Beta distributions with\na closed form:\n\nDKL(qj||pj) = log B(\u03b10\n\nB(\u03b1j, \u03b2j)! + (\u03b1j \u2212 \u03b10\n\nj , \u03b2 0\nj )\n\nj )\u03c8(\u03b1j)\n\n(8)\n\n+(\u03b2j \u2212 \u03b2 0\n\nj )\u03c8(\u03b2j) \u2212 (\u03b1j + \u03b2j \u2212 \u03b10\n\nj \u2212 \u03b2 0\n\nj )\u03c8(\u03b1j + \u03b2j),\n\nwhere \u03c8 is the digamma function.\n\nHowever, the expected log likelihood of a noisy-OR integrated over a product of Beta dis-\ntributions does not have an analytic form. Therefore, we employ the MF(0) approximation\nof Ng and Jordan [9], replacing the expectation of the log likelihood with the log likelihood\nof the expectation. The second term then becomes the sum of a set of log likelihoods, one\nper observation:\n\nL(Vi) =(log(cid:16)1 \u2212 (1 \u2212 ri0)Qj[1 \u2212 rij\u03b1j/(\u03b1j + \u03b2j)](cid:17) if Vi = 1 (failure);\n\nif Vi = 0 (success).\n\nlog(1 \u2212 ri0) +Pj log[1 \u2212 rij\u03b1j/(\u03b1j + \u03b2j)]\n\nFor the Internet diagnosis case, the MF(0) approximation is reasonable: we expect the\nposterior distribution to be concentrated around its mean, due to the large amount of data\nthat is available. Ng and Jordan [9] have have proved accuracy bounds for MF(0) based on\nthe number of parents that an observation has.\n\n(9)\n\nThe \ufb01nal cost function for a minimization routine then becomes\n\nmin\n~\u03b1,~\u03b2\n\nC =Xj\n\nDKL(qj||pj) \u2212Xi\n\nL(Vi).\n\n(10)\n\n3 Variational inference by stochastic gradient descent\n\nIn order to apply unconstrained optimization algorithms to minimize (10), we need transform\nthe variables: only positive \u03b1j and \u03b2j are valid, so we parameterize them by\n\nand the gradient computation becomes\n\n\u03b1j = eaj ,\n\n\u03b2j = ebj .\n\n(11)\n\n(12)\n\n\u2202C\n\u2202aj\n\n= \u03b1j\uf8eb\n\uf8edXj\n\n\u2202DKL(qj||pj)\n\n\u2202\u03b1j\n\n\u2212Xi\n\n\u2202L(Vi)\n\n\u2202\u03b1j \uf8f6\n\uf8f8 .\n\nwith a similar gradient for bj. Note that this gradient computation can be quite computa-\ntionally expensive, given that i sums over all of the observations.\n\nFor Internet diagnosis, we can decompose the observation stream into blocks, where the size\nof the block is determined by how quickly the underlying rates of faults change, and how\n\ufb01nely we want to sample those rates. We typically use blocks of 100,000 observations, which\ncan make the computation of the gradient expensive. Further, we repeat the inference over\nand over again, on thousands of blocks of data: we prefer a fast optimization procedure over\na highly accurate one.\n\nTherefore, we investigated the use of stochastic gradient descent for optimizing the vari-\national cost function. Stochastic gradient descent approximates the full gradient with a\n\n4\n\n\fAlgorithm 1 Variational Gradient Descent\nRequire: Noisy-OR parameters rij, priors \u03b10\n\nj , \u03b2 0\n\nj , observations Vi\n\nInitialize aj = log(\u03b10\nInitialize yi, zj to 0\nfor k = 1 to number of epochs do\n\nj ), bj = log(\u03b2 0\nj )\n\nfor all Faults j do\n\n\u03b1j = exp(aj), \u03b2j = exp(bj)\nyj \u2190 \u03beyj + (1 \u2212 \u03be)\u2202DKL(qj||pj; \u03b1j, \u03b2j)/\u2202aj\nzj \u2190 \u03bezj + (1 \u2212 \u03be)\u2202DKL(qj||pj; \u03b1j, \u03b2j)/\u2202bj\naj \u2190 aj \u2212 \u03b7yj\nbj \u2190 bj \u2212 \u03b7zj\n\nend for\nfor all Observations i do\n\nfor all Parent faults j of observation vi do\n\n\u03b1j = exp(aj), \u03b2j = exp(bj)\n\nend for\nfor all Parent faults j of observation vi do\n\nyj \u2190 \u03beyj \u2212 (1 \u2212 \u03be)\u2202L(Vi; ~\u03b1, ~\u03b2)/\u2202aj\nzj \u2190 \u03bezj \u2212 (1 \u2212 \u03be)\u2202L(Vi; ~\u03b1, ~\u03b2)/\u2202bj\naj \u2190 aj \u2212 \u03b7yj\nbj \u2190 bj \u2212 \u03b7zj\n\nend for\n\nend for\n\nend for\n\nsingle term from the gradient: the state of the optimization is updated using that single\nterm [5]. This enables the system to converge quickly to an approximate answer. The details\nof stochastic gradient descent are shown in Algorithm 1.\n\nEstimating the sum in equation (12) with a single term adds a tremendous amount of noise\nto the estimates. For example, the sign of a single L(Vi) gradient term depends only on\nthe sign of Vi. In order to reduce the noise in the estimate, we use momentum [15]: we\nexponentially smooth the gradient with a \ufb01rst-order \ufb01lter before applying it to the state\nvariables. This momentum modi\ufb01cation is shown in Algorithm 1. We typically use a large\nstep size (\u03b7 = 0.1) and momentum term (\u03be = 0.99), in order to both react quickly to changes\nin the fault rate and to smooth out noise.\n\nStochastic gradient descent can be used as a purely on-line method (where each data point\nis seen only once), setting the \u201cnumber of epochs\u201d in Algorithm 1 to 1. Alternatively, it can\nget higher accuracy if it is allowed to sweep through the data multiple times.\n\n3.1 Other possible approaches\n\nWe considered and tested several other approaches to solving the approximate inference\nproblem.\n\nJaakkola and Jordan propose a variational inference method for bipartite noisy-OR net-\nworks [3], where one variational parameter is introduced to unlink one observation from\nthe network. We typically have far more observations than possible faults: this previous\napproach would have forced us to solve very large optimization problems (with 100,000 pa-\nrameters). Instead, we solve an optimization that has dimension equal to the number of\nfaults.\n\nWe originally optimized the variational cost function (10) with both BFGS and the trust-\nregion algorithm in the Matlab optimization toolbox. This turned out to be far worse than\nstochastic gradient descent. We found that a C# implementation of L-BFGS, as described in\nNocedal and Wright [10] sped up the exact optimization by orders of magnitude. We report\non the L-BFGS performance, below:\nit is within 4x the speed of the stochastic gradient\ndescent.\n\n5\n\n\fWe experimented with Metropolis-Hastings to sample from the posterior, using a Gaussian\nrandom walk in (aj, bj). We found that the burn-in time was very long. Also, each update\nis slow, because the speed of a single update depends on the fan-out of each fault. In the\nInternet diagnosis network, the fan-out is quite high (because a single fault a\ufb00ects many\nobservations). Thus, Metropolis-Hastings was far slower than variational inference.\n\nWe did not try loopy belief propagation [8], nor expectation propagation [6]. Because the\nBeta distribution is not conjugate to the noisy OR, the messages passed by either algorithm\ndo not have a closed form.\n\nFinally, we did not try the idea of learning to predict the posterior from the observations\nby sampling from the generative model and learning the reverse mapping [7]. For Internet\ndiagnosis, we do not know the structure of graphical model for a block of data ahead of\ntime: the structure depends on the metadata for the requests in the log. Thus, we cannot\namortize the learning time of a predictive model.\n\n4 Results\n\nWe test the approximations and optimization methods used for Internet diagnosis on both\nsynthetic and real data.\n\n4.1 Synthetic data with known hidden state\n\nTesting the accuracy of approximate inference is very di\ufb03cult, because, for large graphical\nmodels, the true posterior distribution is intractable. However, we can probe the reliability\nof the model on a synthetic data set.\n\nWe start by generating fault rates from a prior (here, 2000 faults drawn from Beta(5e-\n3,1)). We randomly generate connections from faults to observations, with probability\n5 \u00d7 10\u22123. Each connection has a strength rij drawn randomly from [0, 1]. We generate\n100,000 observations from the noisy-OR model (4). Given these observations, we predict an\napproximate posterior.\n\nGiven that the number of observations is much larger than the number of faults, we expect\nthat the posterior distribution should tightly cluster around the rate that generated the\nobservations. Di\ufb00erence between the true rate and the mean of the approximate posterior\nshould re\ufb02ect inaccuracies in the estimation.\n\nt\n\ne\na\nm\n\n \n\ni\nt\ns\ne\ne\na\nr\n \nf\n\nt\n\no\n \nr\no\nr\nr\n\nE\n\n0.08\n0.06\n0.04\n0.02\n0\n-0.02\n-0.04\n-0.06\n-0.08\n-0.1\n-0.12\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\nFigure 3: The error in estimate of rate versus true underlying rate. Black dots are L-BFGS,\nRed dots are Stochastic Gradient Descent with 20 epochs.\n\nThe results for a run is shown in Figure 3. The \ufb01gure shows that the errors in the estimate\nare small enough to be very useful for understanding network errors. There is a slight\nsystematic bias in the stochastic gradient descent, as compared to L-BFGS. However, the\nimprovement in speed shown in Table 1 is worth the loss of accuracy: we need inference to\n\n6\n\n\fbe as fast as possible to scale to billions of samples. The run times are for a uniprocessor\nPentium 4, 3 GHz, with code in C#.\n\nAlgorithm\n\nL-BFGS\nSGD, 1 epoch\nSGD, 20 epochs\n\nAccuracy\n(RMSE)\n0.0033\n0.0343\n0.0075\n\nTime\n\n(CPU sec)\n\n38\n0.5\n11.7\n\nTable 1: Accuracy and speed on synthetic data set\n\n4.2 Real data from web server logs\n\nWe then tested the algorithm on real data from a major web service. Each observation\nconsists of a success or failure of a single HTTP request. We selected 18848 possible faults\nthat occur frequently in the dataset, including the web server that received the request,\nwhich autonomous system that originated the request, and which \u201cuser agent\u201d (brower or\nrobot) generated the request.\n\nWe have been analyzing HTTP logs collected over several months with the stochastic gra-\ndient descent algorithm. In this paper, we present an analysis of a short 2.5 hour window\ncontaining an anomalously high rate of failures, in order to demonstrate that our algo-\nrithm can help us understand the cause of failures based on observations in a real-world\nenvironment.\n\nWe broke the time series of observations into blocks of 100,000 observations, and inferred\nthe hidden rates for each block. The initial state of the optimizer was set to be the state of\nthe optimizer at convergence of the previous block. Thus, for stochastic gradient descent,\nthe momentum variables were carried forward from block to block.\n\n1\n0.9\n0.8\n0.7\n0.6\n0.5\n0.4\n0.3\n0.2\n0.1\n0\n8:00 PM 8:29 PM 8:57 PM 9:26 PM 9:55 PM 10:24 PM 10:53 PM 11:21 PM\n\nFigure 4: The inferred fault rate for two Autonomous Systems, as a function of time. These\nare the only two faults with high rate.\n\nThe results of this tracking experiment are shown in Figure 4.\nIn this \ufb01gure, we used\nstochastic gradient descent and a Beta(0.1,100) prior. The \ufb01gure shows the only two faults\nwhose probability went higher than 0.1 in this time interval: they correspond to two ASes in\nthe same city, both causing failures at roughly the same time. This could be due to a router\nthat is in common between them, or perhaps an denial of service attack that originated in\nthat city.\n\nThe speed of the analysis is much faster than real time. For a data set of 10 million\nsamples, L-BFGS required 209 CPU seconds, while SGD (with 3 passes of data per block)\nonly required 51 seconds. This allows us to go through logs containing billions of entries in\na matter of hours.\n\n7\n\n\f5 Conclusions\n\nThis paper presents high-speed variational inference to diagnose problems on the scale of\nthe Internet. Given observations at a web server, the diagnosis can determine whether a web\nserver needs rebooting, whether part of the Internet is broken, or whether the web server is\ncompatible with a browser or user agent.\n\nIn order to scale inference up to Internet-sized diagnosis problems, we make several ap-\nproximations. First, we use mean-\ufb01eld variational inference to approximate the posterior\ndistribution. The expected log likelihood inside of the variational cost function is approxi-\nmated with the MF(0) approximation. Finally, we use stochastic gradient descent to perform\nthe variational optimization.\n\nWe are currently using variational stochastic gradient descent to analyze logs that contain\nbillions of requests. We are not aware of any other applications of variational inference at\nthis scale. Future publications will include conclusions of such analysis, and implications\nfor web services and the Internet at large.\n\nReferences\n\n[1] M. Chen, A. X. Zheng, J. Lloyd, M. I. Jordan, and E. Brewer. Failure diagnosis using\n\ndecision trees. In Proc. Int\u2019l. Conf. Autonomic Computing, pages 36\u201343, 2004.\n\n[2] D. Heckerman. A tractable inference algorithm for diagnosing multiple diseases. In\n\nProc. UAI, pages 163\u2013172, 1989.\n\n[3] T. Jaakkola and M. Jordan. Variational probabilistic inference and the QMR-DT\n\ndatabase. Journal of Arti\ufb01cial Intelligence Research, 10:291\u2013322, 1999.\n\n[4] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to\n\nvariational methods for graphical models. Machine Learning, 37:183\u2013233, 1999.\n\n[5] H. J. Kushner and G. G. Yin. Stochastic Approximation and Recursive Algorithms and\n\nApplications. Springer-Verlag, 2003.\n\n[6] T. P. Minka. Expectation propagation for approximate bayesian inference. In Proc.\n\nUAI, pages 362\u2013369, 2001.\n\n[7] Q. Morris. Recognition networks for approximate inference in BN20 networks. In Proc.\n\nUAI, pages 370\u201337, 2001.\n\n[8] K. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief propagation for approximate\n\ninference: An empirical study. In Proc. UAI, pages 467\u2013475, 1999.\n\n[9] A. Y. Ng and M. Jordan. Approximate inference algorithms for two-layer bayesian\n\nnetworks. In Proc. NIPS, pages 533\u2013539, 1999.\n\n[10] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2nd edition, 2006.\n[11] J. Pearl. Probabilistic Reasoning In Intelligent Systems: Networks of Plausible Infer-\n\nence. Morgan Kaufmann, 1988.\n\n[12] I. Rish, M. Brodie, and S. Ma. Accuracy vs. e\ufb03ciency tradeo\ufb00s in probabilistic diag-\n\nnosis. In Proc. AAAI, pages 560\u2013566, 2001.\n\n[13] M. A. Shwe and G. F. Cooper. An empirical analysis of likelihood-weighting simulation\non a large, multiply-connected medical belief network. Computers and Biomedical\nResearch, 24(5):453\u2013475, 1991.\n\n[14] M. A. Shwe, B. Middleton, D. E. Heckerman, M. Henrion, E. J. Horvitz, H. P. Lehmann,\nand G. F. Cooper. Probabilistic diagnosis using a reformulation of the INTERNIST-\n1/QMR knowledge base. Methods of Information in Medicine, 30(4):241\u2013255, 1991.\n\n[15] J. J. Shynk and S. Roy. The LMS algorithm with momentum updating. In Proc. Intl.\n\nSymp. Circuits and Systems, pages 2651\u20132654, 1988.\n\n[16] M. Steinder and A. Sethi. End-to-end service failure diagnosis using belief networks.\n\nIn Proc. Network Operations and Management Symposium, pages 375\u2013390, 2002.\n\n8\n\n\f", "award": [], "sourceid": 853, "authors": [{"given_name": "Emre", "family_name": "Kiciman", "institution": null}, {"given_name": "David", "family_name": "Maltz", "institution": null}, {"given_name": "John", "family_name": "Platt", "institution": null}]}