{"title": "A Bayesian Spatial Scan Statistic", "book": "Advances in Neural Information Processing Systems", "page_first": 1003, "page_last": 1010, "abstract": "", "full_text": "A Bayesian Spatial Scan Statistic\n\nDaniel B. Neill\n\nAndrew W. Moore\n\nSchool of Computer Science\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\n{neill,awm}@cs.cmu.edu\n\nGregory F. Cooper\n\nCenter for Biomedical Informatics\n\nUniversity of Pittsburgh\nPittsburgh, PA 15213\ngfc@cbmi.pitt.edu\n\nAbstract\n\nWe propose a new Bayesian method for spatial cluster detection, the\n\u201cBayesian spatial scan statistic,\u201d and compare this method to the standard\n(frequentist) scan statistic approach. We demonstrate that the Bayesian\nstatistic has several advantages over the frequentist approach, including\nincreased power to detect clusters and (since randomization testing is\nunnecessary) much faster runtime. We evaluate the Bayesian and fre-\nquentist methods on the task of prospective disease surveillance: detect-\ning spatial clusters of disease cases resulting from emerging disease out-\nbreaks. We demonstrate that our Bayesian methods are successful in\nrapidly detecting outbreaks while keeping number of false positives low.\n\nIntroduction\n\n1\nHere we focus on the task of spatial cluster detection: \ufb01nding spatial regions where some\nquantity is signi\ufb01cantly higher than expected. For example, our goal may be to detect\nclusters of disease cases, which may be indicative of a naturally occurring epidemic (e.g.\nin\ufb02uenza), a bioterrorist attack (e.g. anthrax release), or an environmental hazard (e.g. ra-\ndiation leak). [1] discusses many other applications of cluster detection, including mining\nastronomical data, medical imaging, and military surveillance. In all of these applications,\nwe have two main goals: to identify the locations, shapes, and sizes of potential clusters,\nand to determine whether each potential cluster is more likely to be a \u201ctrue\u201d cluster or sim-\nply a chance occurrence. Thus we compare the null hypothesis H0 of no clusters against\nsome set of alternative hypotheses H1(S), each representing a cluster in some region or\nregions S. In the standard frequentist setting, we do this by signi\ufb01cance testing, computing\nthe p-values of potential clusters by randomization; here we propose a Bayesian frame-\nwork, in which we compute posterior probabilities of each potential cluster.\n\nOur primary motivating application is prospective disease surveillance: detecting spatial\nclusters of disease cases resulting from a disease outbreak. In this application, we perform\nsurveillance on a daily basis, with the goal of \ufb01nding emerging epidemics as quickly as\npossible. For this task, we are given the number of cases of some given syndrome type\n(e.g. respiratory) in each spatial location (e.g. zip code) on each day. More precisely, we\ntypically cannot measure the actual number of cases, and instead rely on related observable\nquantities such as the number of Emergency Department visits or over-the-counter drug\nsales. We must then detect those increases which are indicative of emerging outbreaks,\nas close to the start of the outbreak as possible, while keeping the number of false posi-\ntives low. In biosurveillance of disease, every hour of earlier detection can translate into\nthousands of lives saved by more timely administration of antibiotics, and this has led to\nwidespread interest in systems for the rapid and automatic detection of outbreaks.\n\n\fIn this spatial surveillance setting, each day we have data collected for a set of discrete\nspatial locations si. For each location si, we have a count ci (e.g. number of disease cases),\nand an underlying baseline bi. The baseline may correspond to the underlying population\nat risk, or may be an estimate of the expected value of the count (e.g. derived from the\ntime series of previous count data). Our goal, then, is to \ufb01nd if there is any spatial region\nS (set of locations si) for which the counts are signi\ufb01cantly higher than expected, given the\nbaselines. For simplicity, we assume here (as in [2]) that the locations si are aggregated to a\nuniform, two-dimensional, N \u00d7 N grid G, and we search over the set of rectangular regions\nS \u2286 G. This allows us to search both compact and elongated regions, allowing detection of\nelongated disease clusters resulting from dispersal of pathogens by wind or water.\n\n1.1 The frequentist scan statistic\nOne of the most important statistical tools for cluster detection is Kulldorff\u2019s spatial scan\nstatistic [3-4]. This method searches over a given set of spatial regions, \ufb01nding those re-\ngions which maximize a likelihood ratio statistic and thus are most likely to be generated\nunder the alternative hypothesis of clustering rather than the null hypothesis of no clus-\ntering. Randomization testing is used to compute the p-value of each detected region,\ncorrectly adjusting for multiple hypothesis testing, and thus we can both identify potential\nclusters and determine whether they are signi\ufb01cant. Kulldorff\u2019s framework assumes that\ncounts ci are Poisson distributed with ci \u223c Po(qbi), where bi represents the (known) cen-\nsus population of cell si and q is the (unknown) underlying disease rate. Then the goal of\nthe scan statistic is to \ufb01nd regions where the disease rate is higher inside the region than\noutside. The statistic used for this is the likelihood ratio F(S) = P(Data | H1(S))\n, where the\nP(Data | H0)\nnull hypothesis H0 assumes a uniform disease rate q = qall. Under H1(S), we assume that\nq = qin for all si \u2208 S, and q = qout for all si \u2208 G \u2212 S, for some constants qin > qout. From\nthis, we can derive an expression for F(S) using maximum likelihood estimates of qin,\nqout, and qall: F(S) = ( Cin\n)\u2212Call , if Cin\n, and F(S) = 1 otherwise.\nBin\nBin\nIn this expression, we have Cin = (cid:229) S ci, Cout = (cid:229) G\u2212S ci, Call = (cid:229) G ci, and similarly for the\nbaselines Bin = (cid:229) S bi, Bout = (cid:229) G\u2212S bi, and Ball = (cid:229) G bi.\nOnce we have found the highest scoring region S\u2217 = argmaxS F(S) of grid G, and its score\nF \u2217 = F(S\u2217), we must still determine the statistical signi\ufb01cance of this region by random-\nization testing. To do so, we randomly create a large number R of replica grids by sampling\nunder the null hypothesis ci \u223c Po(qallbi), and \ufb01nd the highest scoring region and its score\nfor each replica grid. Then the p-value of S\u2217 is Rbeat +1\nR+1 , where Rbeat is the number of repli-\ncas G0 with F \u2217 higher than the original grid. If this p-value is less than some threshold (e.g.\n0.05), we can conclude that the discovered region is unlikely to have occurred by chance,\nand is thus a signi\ufb01cant spatial cluster; otherwise, no signi\ufb01cant clusters exist.\n\n)Cout ( Call\nBall\n\n)Cin( Cout\nBout\n\n> Cout\nBout\n\nThe frequentist scan statistic is a useful tool for cluster detection, and is commonly used in\nthe public health community for detection of disease outbreaks. However, there are three\nmain disadvantages to this approach. First, it is dif\ufb01cult to make use of any prior informa-\ntion that we may have, for example, our prior beliefs about the size of a potential outbreak\nand its impact on disease rate. Second, the accuracy of this technique is highly dependent\non the correctness of our maximum likelihood parameter estimates. As a result, the model\nis prone to parameter over\ufb01tting, and may lose detection power in practice because of\nmodel misspeci\ufb01cation. Finally, the frequentist scan statistic is very time consuming, and\nmay be computationally infeasible for large datasets. A naive approach requires searching\nover all rectangular regions, both for the original grid and for each replica grid. Since there\nare O(N4) rectangles to search for an N \u00d7 N grid, the total computation time is O(RN4),\nwhere R = 1000 is a typical number of replications. In past work [5, 2, 6], we have shown\nhow to reduce this computation time by a factor of 20-2000x through use of the \u201cfast spatial\nscan\u201d algorithm; nevertheless, we must still perform this faster search both for the original\ngrid and for each replica.\n\n\fWe propose to remedy these problems through the use of a Bayesian spatial scan statistic.\nFirst, our Bayesian model makes use of prior information about the likelihood, size, and\nimpact of an outbreak. If these priors are chosen well, we should achieve better detec-\ntion power than the frequentist approach. Second, the Bayesian method uses a marginal\nlikelihood approach, averaging over possible values of the model parameters qin, qout, and\nqall, rather than relying on maximum likelihood estimates of these parameters. This makes\nthe model more \ufb02exible and less prone to over\ufb01tting, and reduces the potential impact of\nmodel misspeci\ufb01cation. Finally, under the Bayesian model there is no need for randomiza-\ntion testing, and (since we need only to search the original grid) even a naive search can be\nperformed relatively quickly. We now present the Bayesian spatial scan statistic, and then\ncompare it to the frequentist approach on the task of detecting simulated disease epidemics.\n\n2 The Bayesian scan statistic\nHere we consider the natural Bayesian extension of Kulldorff\u2019s scan statistic, moving from\na Poisson to a conjugate Gamma-Poisson model. Bayesian Gamma-Poisson models are\na common representation for count data in epidemiology, and have been used in disease\nmapping by Clayton and Kaldor [7], Molli\u00b4e [8], and others. In disease mapping, the effect\nof the Gamma prior is to produce a spatially smoothed map of disease rates; here we instead\nfocus on computing the posterior probabilities, allowing us to determine the likelihood that\nan outbreak has occurred, and to estimate the location and size of potential outbreaks.\n\nFor the Bayesian spatial scan, as in the frequentist approach, we wish to compare the null\nhypothesis H0 of no clusters to the set of alternative hypotheses H1(S), each representing\na cluster in some region S. As before, we assume Poisson likelihoods, ci \u223c Po(qbi). The\ndifference is that we assume a hierarchical Bayesian model where the disease rates qin, qout,\nand qall are themselves drawn from Gamma distributions. Thus, under the null hypothesis\nH0, we have q = qall for all si \u2208 G, where qall \u223c Ga(a all,b all). Under the alternative hypoth-\nesis H1(S), we have q = qin for all si \u2208 S and q = qout for all si \u2208 G \u2212 S, where we indepen-\ndently draw qin \u223c Ga(a\nin) and qout \u223c Ga(a out,b out). We discuss how the a and b priors\nare chosen below. From this model, we can compute the posterior probabilities P(H1(S)|D)\nof an outbreak in each region S, and the probability P(H0 | D) that no outbreak has oc-\ncurred, given dataset D: P(H0 | D) = P(D | H0)P(H0)\nand P(H1(S) | D) = P(D | H1(S))P(H1(S))\n,\nwhere P(D) = P(D | H0)P(H0) + (cid:229) S P(D | H1(S))P(H1(S)). We discuss the choice of prior\nprobabilities P(H0) and P(H1(S)) below. To compute the marginal likelihood of the data\ngiven each hypothesis, we must integrate over all possible values of the parameters (qin,\nqout, qall) weighted by their respective probabilities. Since we have chosen a conjugate\nprior, we can easily obtain a closed-form solution for these likelihoods:\n\nin,b\n\nP(D)\n\nP(D)\n\nP(D | H0) = Z P(qall \u223c Ga(a all,b all)) (cid:213)\nin)) (cid:213)\nP(D | H1(S)) = Z P(qin \u223c Ga(a\n\u00d7Z P(qout \u223c Ga(a out ,b out ))\n\nin,b\n\nsi\u2208G\u2212S\n\nsi\u2208G\n\nP(ci \u223c Po(qallbi)) dqall\n\nP(ci \u223c Po(qinbi)) dqin\n\nsi\u2208S\nP(ci \u223c Po(qoutbi)) dqout\n\nNow, computing the integral, and letting C = (cid:229) ci and B = (cid:229) bi, we obtain:\n\nb)) (cid:213)\n\nsi\n\nZ P(q \u223c Ga(a,\nG( a) Z q\n\na\u22121 e\u2212b qq(cid:229) ci e\u2212q(cid:229) bi dq =\n\nP(ci \u223c Po(qbi)) dq = Z\nG( a) Z q\n\na\u22121 e\u2212b q(cid:213)\nq\n\nG( a)\n\nsi\na+ C\u22121e\u2212(b +B)q dq =\n\n(qbi)ci e\u2212qbi\n\n(ci)!\n\ndq (cid:181)\nG( a +C)\na+ C G( a)\n\n(b + B)\n\nThus we have the following expressions for the marginal likelihoods: P(D | H0) (cid:181)\nall G(a\n.\n(b all +Ball )\n\n, and P(D | H1(S)) (cid:181)\n\n\u00d7 (b out )\n\n(b\nin G(a\nin+Bin)\n\n(b out +Bout )\n\na out +Cout G(a\n\nall +Call G(a\n\nin+Cin)\n\nin+Cin G(a\n\nin)\n\nout +Cout )\n\nall +Call )\n\n(b all )\n\nin)\n\n(b\n\na out G(a\n\nout )\n\nall )\n\n(cid:213)\nb\na\nb\na\nb\na\nb\na\na\na\na\na\n\fThe Bayesian spatial scan statistic can be computed simply by \ufb01rst calculating the score\nP(D | H1(S))P(H1(S)) for each spatial region S, maintaining a list of regions ordered by\nscore. We then calculate P(D | H0)P(H0), and add this to the sum of all region scores, ob-\ntaining the probability of the data P(D). Finally, we can compute the posterior probability\nP(H1(S) |D) = P(D | H1(S))P(H1(S))\n. Then\nwe can return all regions with non-negligible posterior probabilities, the posterior probabil-\nity of each, and the overall probability of an outbreak. Note that no randomization testing\nis necessary, and thus overall complexity is proportional to number of regions searched,\ne.g. O(N4) for searching over axis-aligned rectangles in an N \u00d7 N grid.\n\nfor each region, as well as P(H0 |D) = P(D | H0)P(H0)\n\nP(D)\n\nP(D)\n\n2.1 Choosing priors\nOne of the most challenging tasks in any Bayesian analysis is the choice of priors. For\nin(S), b\nany region S that we examine, we must have values of the parameter priors a\nin(S),\na out(S), and b out(S), as well as the region prior probability P(H1(S)). We must also choose\nthe global parameter priors a all and b all, as well as the \u201cno outbreak\u201d prior P(H0).\nHere we consider the simple case of a uniform region prior, with a known prior probability\nof an outbreak P1.\nIn other words, if there is an outbreak, it is assumed to be equally\nlikely to occur in any spatial region. Thus we have P(H0) = 1 \u2212 P1, and P(H1(S)) = P1\n,\nNreg\nwhere Nreg is the total number of regions searched. The parameter P1 can be obtained from\nhistorical data, estimated by human experts, or can simply be used to tune the sensitivity\nand speci\ufb01city of the algorithm. The model can also be easily adapted to a non-uniform\nregion prior, taking into account our prior beliefs about the size and shape of outbreaks.\n\nFor the parameter priors, we assume that we have access to a large number of days of past\ndata, during which no outbreaks are known to have occurred. We can then obtain estimated\nvalues of the parameter priors under the null hypothesis by matching the moments of each\nGamma distribution to their historical values. In other words, we set the expectation and\nvariance of the Gamma distribution Ga(a all,b all) to the sample expectation and variance\nBall i. Solving for\nof Call\nBall\n\n= Esampleh Call\n\nobserved in past data:\n\na all\nb all\nBall i(cid:17)2\n(cid:16)Esampleh Call\nVarsampleh Call\nBall i\n\na all\nb 2\nall\n\nBall i, and\nand b all =\n\n= Varsampleh Call\nEsampleh Call\nBall i\nVarsampleh Call\nBall i\n\n.\n\na all and b all, we obtain a all =\n\nin(S), b\n\nand b out(S) =\n\nBout (S)i(cid:17)2\n(cid:16)Esampleh Cout (S)\nVarsampleh Cout (S)\nBout (S)i\n\nThe calculation of priors a\nin(S), a out(S), and b out(S) is identical except for two dif-\nferences: \ufb01rst, we must condition on the region S, and second, we must assume the alterna-\ntive hypothesis H1(S) rather than the null hypothesis H0. Repeating the above derivation for\nEsampleh Cout (S)\nBout (S)i\nthe \u201cout\u201d parameters, we obtain a out(S) =\n,\nVarsampleh Cout (S)\nBout (S)i\nwhere Cout(S) and Bout(S) are respectively the total count (cid:229) G\u2212S ci and total baseline (cid:229) G\u2212S bi\noutside the region. Note that an outbreak in some region S does not affect the disease rate\noutside region S. Thus we can use the same values of a out(S) and b out(S) whether we are\nassuming the null hypothesis H0 or the alternative hypothesis H1(S).\nOn the other hand, the effect of an outbreak inside region S must be taken into account when\ncomputing a\nin(S); since we assume that no outbreak has occurred in the past\ndata, we cannot just use the sample mean and variance, but must consider what we expect\nthese quantities to be in the event of an outbreak. We assume that the outbreak will increase\nqin by a multiplicative factor m, thus multiplying the mean and variance of Cin\nby m. To\nBin\naccount for this in the Gamma distribution Ga(a\nin by m while leaving\nin unchanged. Thus we have a\n\nin), we multiply a\nBin(S)i(cid:17)2\nin(S) = m(cid:16)Esampleh Cin(S)\nVarsampleh Cin(S)\nBin(S)i\n\nEsampleh Cin(S)\nBin(S)i\nVarsampleh Cin(S)\nBin(S)i\n\nin(S) and b\n\nin(S) =\n\nand b\n\nin,b\n\n,\n\nb\n\fwhere Cin(S) = (cid:229) S ci and Bin(S) = (cid:229) S bi. Since we typically do not know the exact value of\nm, here we use a discretized uniform distribution for m, ranging from m = 1 . . .3 at intervals\nof 0.2. Then scores can be calculated by averaging likelihoods over the distribution of m.\n\nFinally, we consider how to deal with the case where the past values of the counts and\nbaselines are not given. In this \u201cblind Bayesian\u201d (BBayes) case, we assume that counts\nare randomly generated under the null hypothesis ci \u223c Po(q0bi), where q0 is the expected\nratio of count to baseline under the null (for example, q0 = 1 if baselines are obtained\nby estimating the expected value of the count). Under this simple assumption, we can\neasily compute the expectation and variance of the ratio of count to baseline under the null\nhypothesis: E(cid:2) C\nB . Thus\nwe have a = q0B and b = B under the null hypothesis. This gives us a all = q0Ball, b all =\nBall, a out(S) = q0Bout(S), b out(S) = Bout(S), a\nin(S) = Bin(S). We\ncan use a uniform distribution for m as before.\nIn our empirical evaluation below, we\nconsider both the Bayes and BBayes methods of generating parameter priors.\n\nB = q0, and Var(cid:2) C\n\nin(S) = mq0Bin(S), and b\n\n= q0B\n\nB2 = q0\n\nVar[Po(q0B)]\n\nE[Po(q0B)]\n\nB\n\n= q0B\n\nB(cid:3) =\n\nB(cid:3) =\n\nB2\n\n3 Results: detection power\nWe evaluated the Bayesian and frequentist methods on two types of simulated respiratory\noutbreaks, injected into real Emergency Department and over-the-counter drug sales data\nfor Allegheny County, Pennsylvania. All data were aggregated to the zip code level to\nensure anonymity, giving the daily counts of respiratory ED cases and sales of OTC cough\nand cold medication in each of 88 zip codes for one year. The baseline (expected count)\nfor each zip code was estimated using the mean count of the previous 28 days. Zip code\ncentroids were mapped to a 16 \u00d7 16 grid, and all rectangles up to 8 \u00d7 8 were examined. We\n\ufb01rst considered simulated aerosol releases of inhalational anthrax (e.g. from a bioterrorist\nattack), generated by the Bayesian Aerosol Release Detector, or BARD [9]. The BARD\nsimulator uses a Bayesian network model to determine the number of spores inhaled by\nindividuals in affected areas, the resulting number and severity of anthrax cases, and the\nresulting number of respiratory ED cases on each day of the outbreak in each affected zip\ncode. Our second type of outbreak was a simulated \u201cFictional Linear Onset Outbreak\u201d\n(or \u201cFLOO\u201d), as in [10]. A FLOO(D, T ) outbreak is a simple simulated outbreak with\nduration T , which generates tD\ncases in each affected zip code on day t of the outbreak\n(0 < t \u2264 T /2), then generates TD/2 cases per day for the remainder of the outbreak. Thus\nwe have an outbreak where the number of cases ramps up linearly and then levels off.\nWhile this is clearly a less realistic outbreak than the BARD-simulated anthrax attack, it\ndoes have several advantages: most importantly, it allows us to precisely control the slope\nof the outbreak curve and examine how this affects our methods\u2019 detection ability.\n\nTo test detection power, a semi-synthetic testing framework similar to [10] was used: we\n\ufb01rst run our spatial scan statistic for each day of the last nine months of the year (the \ufb01rst\nthree months are used only to estimate baselines and priors), and obtain the score F \u2217 for\neach day. Then for each outbreak we wish to test, we inject that outbreak into the data, and\nobtain the score F \u2217(t) for each day t of the outbreak. By \ufb01nding the proportion of baseline\ndays with scores higher than F \u2217(t), we can determine the proportion of false positives we\nwould have to accept to detect the outbreak on day t. This allows us to compute, for any\ngiven level of false positives, what proportion of outbreaks can be detected, and the mean\nnumber of days to detection. We compare three methods of computing the score F \u2217: the fre-\nquentist method (F \u2217 is the maximum likelihood ratio F(S) over all regions S), the Bayesian\nmaximum method (F \u2217 is the maximum posterior probability P(H1(S) | D) over all regions\nS), and the Bayesian total method (F \u2217 is the sum of posterior probabilities P(H1(S)|D) over\nall regions S, i.e. total posterior probability of an outbreak). For the two Bayesian methods,\nwe consider both Bayes and BBayes methods for calculating priors, thus giving us a total\nof \ufb01ve methods to compare (frequentist, Bayes max, BBayes max, Bayes tot, BBayes tot).\nIn Table 1, we compare these methods with respect to proportion of outbreaks detected and\n\n\fTable 1: Days to detect and proportion of outbreaks detected, 1 false positive/month\nFLOO OTC\n\nFLOO OTC\n\nBARD ED\n\nFLOO ED\n\nFLOO ED\n\nFLOO ED\n\nBARD ED\n\nmethod\n\nfrequentist\n\nBayes max\n\nBBayes max\n\nBayes tot\n\nBBayes tot\n\n(4,14)\n1.859\n(100%)\n1.740\n(100%)\n1.683\n(100%)\n1.882\n(100%)\n1.840\n(100%)\n\n(2,20)\n3.324\n(100%)\n2.875\n(100%)\n2.848\n(100%)\n3.195\n(100%)\n3.180\n(100%)\n\n(1,20)\n6.122\n(96%)\n5.043\n(100%)\n4.984\n(100%)\n5.777\n(100%)\n5.672\n(100%)\n\n(.125)\n1.733\n(100%)\n1.600\n(100%)\n1.600\n(100%)\n1.633\n(100%)\n1.617\n(100%)\n\n(.016)\n3.925\n(88%)\n3.755\n(88%)\n3.698\n(88%)\n3.811\n(88%)\n3.792\n(88%)\n\n(40,14)\n3.582\n(100%)\n5.455\n(63%)\n5.164\n(65%)\n3.475\n(100%)\n4.380\n(100%)\n\n(25,20)\n5.393\n(100%)\n7.588\n(79%)\n7.035\n(77%)\n5.195\n(100%)\n6.929\n(99%)\n\nmean number of days to detect, at a false positive rate of 1/month. Methods were evaluated\non seven types of simulated outbreaks: three FLOO outbreaks on ED data, two FLOO out-\nbreaks on OTC data, and two BARD outbreaks (with different amounts of anthrax release)\non ED data. For each outbreak type, each method\u2019s performance was averaged over 100 or\n250 simulated outbreaks for BARD or FLOO respectively.\n\nIn Table 1, we observe very different results for the ED and OTC datasets. For the \ufb01ve runs\non ED data, all four Bayesian methods consistently detected outbreaks faster than the fre-\nquentist method. This difference was most evident for the more slowly growing (harder to\ndetect) outbreaks, especially FLOO(1,20). Across all ED outbreaks, the Bayesian meth-\nods showed an average improvement of between 0.13 days (Bayes tot) and 0.43 days\n(BBayes max) as compared to the frequentist approach; \u201cmax\u201d methods performed sub-\nstantially better than \u201ctot\u201d methods, and \u201cBBayes\u201d methods performed slightly better than\n\u201cBayes\u201d methods. For the two runs on OTC data, on the other hand, most of the Bayesian\nmethods performed much worse (over 1 day slower) than the frequentist method. The ex-\nception was the Bayes tot method, which again outperformed the frequentist method by an\naverage of 0.15 days. We believe that the main reason for these differing results is that the\nOTC data is much noisier than the ED data, and exhibits much stronger seasonal trends.\nAs a result, our baseline estimates (using mean of the previous 28 days) are reasonably ac-\ncurate for ED, but for OTC the baseline estimates will lag behind the seasonal trends (and\nthus, underestimate the expected counts for increasing trends and overestimate for decreas-\ning trends). The BBayes methods, which assume E[C/B] = 1 and thus rely heavily on the\naccuracy of baseline estimates, are not reasonable for OTC. On the other hand, the Bayes\nmethods (which instead learn the priors from previous counts and baselines) can adjust for\nconsistent misestimation of baselines and thus more accurately account for these seasonal\ntrends. The \u201cmax\u201d methods perform badly on the OTC data because a large number of\nbaseline days have posterior probabilities close to 1; in this case, the maximum region pos-\nterior varies wildly from day to day, depending on how much of the total probability is\nassigned to a single region, and is not a reliable measure of whether an outbreak has oc-\ncurred. The total posterior probability of an outbreak, on the other hand, will still be higher\nfor outbreak than non-outbreak days, so the \u201ctot\u201d methods can perform well on OTC as\nwell as ED data. Thus, our main result is that the Bayes tot method, which infers baselines\nfrom past counts and uses total posterior probability of an outbreak to decide when to sound\nthe alarm, consistently outperforms the frequentist method for both ED and OTC datasets.\n\n4 Results: computation time\nAs noted above, the Bayesian spatial scan must search over all rectangular regions for the\noriginal grid only, while the frequentist scan (in order to calculate statistical signi\ufb01cance by\nrandomization) must also search over all rectangular regions for a large number (typically\nR = 1000) of replica grids. Thus, as long as the search time per region is comparable for the\nBayesian and frequentist methods, we expect the Bayesian approach to be approximately\n1000x faster. In Table 2, we compare the run times of the Bayes, BBayes, and frequen-\n\n\fTable 2: Comparison of run times for varying grid size N\nN = 256\n12 hrs\n10 hrs\n\nBayes (naive)\nBBayes (naive)\n\nmethod\n\nN = 16\n0.7 sec\n0.6 sec\n12 min\n20 sec\n\nN = 32\n10.8 sec\n9.3 sec\n2.9 hrs\n1.8 min\n\nN = 64\n2.8 min\n2.4 min\n49 hrs\n10.7 min\n\nfrequentist (naive)\nfrequentist (fast)\n\nN = 128\n44 min\n37 min\n\u223c31 days\n77 min\n\n\u223c500 days\n\n10 hrs\n\ntist methods for searching a single grid and calculating signi\ufb01cance (p-values or posterior\nprobabilities for the frequentist and Bayesian methods respectively), as a function of the\ngrid size N. All rectangles up to size N/2 were searched, and for the frequentist method\nR = 1000 replications were performed. The results con\ufb01rm our intuition: the Bayesian\nmethods are 900-1200x faster than the frequentist approach, for all values of N tested.\nHowever, the frequentist approach can be accelerated dramatically using our \u201cfast spatial\nscan\u201d algorithm [2], a multiresolution search method which can \ufb01nd the highest scoring\nregion of a grid while searching only a small subset of regions. Comparing the fast spatial\nscan to the Bayesian approach, we see that the fast spatial scan is slower than the Bayesian\nmethod for grid sizes up to N = 128, but slightly faster for N = 256. Thus we now have two\noptions for making the spatial scan statistic computationally feasible for large grid sizes:\nto use the fast spatial scan to speed up the frequentist scan statistic, or to use the Bayesian\nscan statistics framework (in which case the naive algorithm is typically fast enough). For\neven larger grid sizes, it may be possible to extend the fast spatial scan to the Bayesian\napproach: this would give us the best of both worlds, searching only one grid, and using a\nfast algorithm to do so. We are currently investigating this potentially useful synthesis.\n5 Discussion\nWe have presented a Bayesian spatial scan statistic, and demonstrated several ways in\nwhich this method is preferable to the standard (frequentist) scan statistics approach. In\nSection 3, we demonstrated that the Bayesian method, with a relatively non-informative\nprior distribution, consistently outperforms the frequentist method with respect to detec-\ntion power. Since the Bayesian framework allows us to easily incorporate prior informa-\ntion about size, shape, and impact of an outbreak, it is likely that we can achieve even\nbetter detection performance using more informative priors, e.g. obtained from experts in\nthe domain. In Section 4, we demonstrated that the Bayesian spatial scan can be computed\nin much less time than the frequentist method, since randomization testing is unnecessary.\nThis allows us to search large grid sizes using a naive search algorithm, and even larger\ngrids might be searched by extending the fast spatial scan to the Bayesian framework.\n\nWe now consider three other arguments for use of the Bayesian spatial scan. First, the\nBayesian method has easily interpretable results: it outputs the posterior probability that\nan outbreak has occurred, and the distribution of this probability over possible outbreak\nregions. This makes it easy for a user (e.g. public health of\ufb01cial) to decide whether to\ninvestigate each potential outbreak based on the costs of false positives and false negatives;\nthis type of decision analysis cannot be done easily in the frequentist framework. Another\nuseful result of the Bayesian method is that we can compute a \u201cmap\u201d of the posterior proba-\nbilities of an outbreak in each grid cell, by summing the posterior probabilities P(H1(S) |D)\nof all regions containing that cell. This technique allows us to deal with the case where the\nposterior probability mass is spread among many regions, by observing cells which are\ncommon to most or all of these regions. We give an example of such a map below:\n\nFigure 1: Output of Bayesian spatial scan on baseline OTC data, 1/30/05.\nCell shading is based on posterior probability of an outbreak in that cell,\nranging from white (0%) to black (100%). The bold rectangle represents\nthe most likely region (posterior probability 12.27%) and the darkest cell\nis the most likely cell (total posterior probability 86.57%). Total posterior\nprobability of an outbreak is 86.61%.\n\n\fSecond, calibration of the Bayesian statistic is easier than calibration of the frequentist\nstatistic. As noted above, it is simple to adjust the sensitivity and speci\ufb01city of the Bayesian\nmethod by setting the prior probability of an outbreak P1, and then we can \u201csound the\nalarm\u201d whenever posterior probability of an outbreak exceeds some threshold. In the fre-\nquentist method, on the other hand, many regions in the baseline data have suf\ufb01ciently\nhigh likelihood ratios that no replicas beat the original grid; thus we cannot distinguish the\np-values of outbreak and non-outbreak days. While one alternative is to \u201csound the alarm\u201d\nwhen the likelihood ratio is above some threshold (rather than when p-value is below some\nthreshold), this is technically incorrect: because the baselines for each day of data are dif-\nferent, the distribution of region scores under the null hypothesis will also differ from day\nto day, and thus days with higher likelihood ratios do not necessarily have lower p-values.\nThird, we argue that it is easier to combine evidence from multiple detectors within the\nBayesian framework, i.e. by modeling the joint probability distribution. We are in the pro-\ncess of examining Bayesian detectors which look simultaneously at the day\u2019s Emergency\nDepartment records and over-the-counter drug sales in order to detect emerging clusters,\nand we believe that combination of detectors is an important area for future research.\n\nIn conclusion, we note that, though both Bayesian modeling [7-8] and (frequentist) spa-\ntial scanning [3-4] are common in the spatial statistics literature, this is (to the best of our\nknowledge) the \ufb01rst model which combines the two techniques into a single framework.\nIn fact, very little work exists on Bayesian methods for spatial cluster detection. One no-\ntable exception is the literature on spatial cluster modeling [11-12], which attempts to infer\nthe location of cluster centers by inferring parameters of a Bayesian process model. Our\nwork differs from these methods both in its computational tractability (their models typi-\ncally have no closed form solution, so computationally expensive MCMC approximations\nare used) and its easy interpretability (their models give no indication as to statistical sig-\nni\ufb01cance or posterior probability of clusters found). Thus we believe that this is the \ufb01rst\nBayesian spatial cluster detection method which is powerful and useful, yet computation-\nally tractable. We are currently running the Bayesian and frequentist scan statistics on\ndaily OTC sales data from over 10000 stores, searching for emerging disease outbreaks on\na daily basis nationwide. Additionally, we are working to extend the Bayesian statistic to\nfMRI data, with the goal of discovering regions of brain activity corresponding to given\ncognitive tasks [13, 6]. We believe that the Bayesian approach has the potential to improve\nboth speed and detection power of the spatial scan in this domain as well.\nReferences\n[1] M. Kulldorff. 1999. Spatial scan statistics: models, calculations, and applications. In J. Glaz and M. Balakrishnan, eds., Scan\nStatistics and Applications, Birkhauser, 303-322.\n[2] D. B. Neill and A. W. Moore. 2004. Rapid detection of signi\ufb01cant spatial clusters. In Proc. 10th ACM SIGKDD Intl. Conf.\non Knowledge Discovery and Data Mining, 256-265.\n[3] M. Kulldorff and N. Nagarwalla. 1995. Spatial disease clusters: detection and inference. Statistics in Medicine 14, 799-810.\n[4] M. Kulldorff. 1997. A spatial scan statistic. Communications in Statistics: Theory and Methods 26(6), 1481-1496.\n[5] D. B. Neill and A. W. Moore. 2004. A fast multi-resolution method for detection of signi\ufb01cant spatial disease clusters. In\nAdvances in Neural Information Processing Systems 16, 651-658.\n[6] D. B. Neill, A. W. Moore, F. Pereira, and T. Mitchell. 2005. Detecting signi\ufb01cant multidimensional spatial clusters. In\nAdvances in Neural Information Processing Systems 17, 969-976.\n[7] D. G. Clayton and J. Kaldor. 1987. Empirical Bayes estimates of age-standardized relative risks for use in disease mapping.\nBiometrics 43, 671-681.\n[8] A. Molli\u00b4e. 1999. Bayesian and empirical Bayes approaches to disease mapping. In A. B. Lawson, et al., eds. Disease Mapping\nand Risk Assessment for Public Health. Wiley, Chichester.\n[9] W. Hogan, G. Cooper, M. Wagner, and G. Wallstrom. 2004. A Bayesian anthrax aerosol release detector. Technical Report,\nRODS Laboratory, University of Pittsburgh.\n[10] D. B. Neill, A. W. Moore, M. Sabhnani, and K. Daniel. 2005. Detection of emerging space-time clusters. In Proc. 11th ACM\nSIGKDD Intl. Conf. on Knowledge Discovery and Data Mining.\n[11] R. E. Gangnon and M. K. Clayton. 2000. Bayesian detection and modeling of spatial disease clustering. Biometrics 56,\n922-935.\n[12] A. B. Lawson and D. G. T. Denison, eds. 2002. Spatial Cluster Modelling. Chapman & Hall/CRC, Boca Raton, FL.\n[13] X. Wang, R. Hutchinson, and T. Mitchell. 2004. Training fMRI classi\ufb01ers to detect cognitive states across multiple human\nsubjects. In Advances in Neural Information Processing Systems 16, 709-716.\n\n\f", "award": [], "sourceid": 2819, "authors": [{"given_name": "Daniel", "family_name": "Neill", "institution": null}, {"given_name": "Andrew", "family_name": "Moore", "institution": null}, {"given_name": "Gregory", "family_name": "Cooper", "institution": null}]}