{"title": "Detecting Significant Multidimensional Spatial Clusters", "book": "Advances in Neural Information Processing Systems", "page_first": 969, "page_last": 976, "abstract": null, "full_text": " Detecting Significant Multidimensional Spatial\n Clusters\n\n\n\n Daniel B. Neill, Andrew W. Moore, Francisco Pereira, and Tom Mitchell\n School of Computer Science\n Carnegie Mellon University\n Pittsburgh, PA 15213\n {neill,awm,fpereira,t.mitchell}@cs.cmu.edu\n\n Abstract\n\n Assume a uniform, multidimensional grid of bivariate data, where each\n cell of the grid has a count ci and a baseline bi. Our goal is to find\n spatial regions (d-dimensional rectangles) where the ci are significantly\n higher than expected given bi. We focus on two applications: detection of\n clusters of disease cases from epidemiological data (emergency depart-\n ment visits, over-the-counter drug sales), and discovery of regions of in-\n creased brain activity corresponding to given cognitive tasks (from fMRI\n data). Each of these problems can be solved using a spatial scan statistic\n (Kulldorff, 1997), where we compute the maximum of a likelihood ratio\n statistic over all spatial regions, and find the significance of this region\n by randomization. However, computing the scan statistic for all spatial\n regions is generally computationally infeasible, so we introduce a novel\n fast spatial scan algorithm, generalizing the 2D scan algorithm of (Neill\n and Moore, 2004) to arbitrary dimensions. Our new multidimensional\n multiresolution algorithm allows us to find spatial clusters up to 1400x\n faster than the naive spatial scan, without any loss of accuracy.\n\n\n1 Introduction\n\nOne of the core goals of modern statistical inference and data mining is to discover patterns\nand relationships in data. In many applications, however, it is important not only to discover\npatterns, but to distinguish those patterns that are significant from those that are likely to\nhave occurred by chance. This is particularly important in epidemiological applications,\nwhere a rise in the number of disease cases in a region may or may not be indicative\nof an emerging epidemic. In order to decide whether further investigation is necessary,\nepidemiologists must know not only the location of a possible outbreak, but also some\nmeasure of the likelihood that an outbreak is occurring in that region. Similarly, when\ninvestigating brain imaging data, we want to not only find regions of increased activity, but\ndetermine whether these increases are significant or due to chance fluctuations.\n\nMore generally, we are interested in spatial data mining problems where the goal is detec-\ntion of overdensities: spatial regions with high counts relative to some underlying baseline.\nIn the epidemiological datasets, the count is some quantity (e.g. number of disease cases,\nor units of cough medication sold) in a given area, where the baseline is the expected value\nof that quantity based on historical data. In the brain imaging datasets, our count is the\ntotal fMRI activation in a given set of voxels under the experimental condition, while our\nbaseline is the total activation in that set of voxels under the null or control condition.\n\n\f\nWe consider the case in which data has been aggregated to a uniform, d-dimensional grid.\nFor the fMRI data, we have three spatial dimensions; for the epidemiological data, we have\ntwo spatial dimensions but also use several other quantities (time, patients' age and gender)\nas \"pseudo-spatial\" dimensions; this is discussed in more detail below.\n\nIn the general case, let G be a d-dimensional grid of cells, with size N1 N2 ... Nd.\nEach cell si G (where i is a d-dimensional vector) is associated with a count ci and a\nbaseline bi. Our goal is to search over all d-dimensional rectangular regions S G, and\nfind regions where the total count C(S) = S ci is higher than expected, given the baseline\nB(S) = S bi. In addition to discovering these high-density regions, we must also perform\nstatistical testing to determine whether these regions are significant. As is necessary in\nthe scan statistics framework, we focus on finding the single, most significant region; the\nmethod can be iterated (removing each significant cluster once it is found) to find multiple\nsignificant regions.\n\n1.1 Likelihood ratio statistics\nOur basic model assumes that counts ci are generated by an inhomogeneous Poisson pro-\ncess with mean qbi, where q (the underlying ratio of count to baseline) may vary spatially.\nWe wish to detect hyper-rectangular regions S such that q is significantly higher inside S\nthan outside S. To do so, for a given region S, we assume that q = qin uniformly for cells\nsi S, and q = qout uniformly for cells si G-S. We then test the null hypothesis H0(S):\nqin (1+)qout against the alternative hypothesis H1(S): qin > (1+)qout. If = 0, this is\nequivalent to the classical spatial scan statistic [1-2]: we are testing for regions where qin is\ngreater than qout . However, in many real-world applications (including the epidemiological\nand fMRI datasets discussed later) we expect some fluctuation in the underlying baseline;\nthus, we do not want to detect all deviations from baseline, but only those where the amount\nof deviation is greater than some threshold. For example, a 10% increase in disease cases\nin some region may not be interesting to epidemiologists, even if the underlying population\nis large enough to conclude that this is a \"real\" (statistically significant) increase in q. By\nincreasing , we can focus the scan statistic on regions with larger ratios of count to base-\nline. For example, we can use the scan statistic with = 0.25 to test for regions where qin\nis more than 25% higher than qout . Following Kulldorff [1], our spatial scan statistic is the\nmaximum, over all regions S, of the ratio of the likelihoods under the alternative and null\nhypotheses. Taking logs for convenience, we have:\n\n sup \n q s\n D i\n (S) = log in>(1+)qout S P(ci Po(qinbi))siG-S P(ci Po(qoutbi))\n sup \n qin(1+)qout siS P(ci Po(qinbi))siG-S P(ci Po(qoutbi))\n C(S) C C\n = ( tot tot\n sgn) C(S) log + (C -C(S)\n ( tot\n 1 + )B(S) -C(S))log Btot -B(S) -CtotlogBtot+B(S)\nwhere C(S) and B(S) are the count and baseline of the region S under consideration, Ctot\nand Btot are the total count and baseline of the entire grid G, and sgn = +1 if C(S) > (1 +\n B(S)\n)Ctot-C(S) and -1 otherwise. Then the scan statistic D\n B ,max is equal to the maximum D(S)\n tot -B(S)\nover all spatial regions (d-dimensional rectangles) under consideration. We note that our\nstatistical and computational methods are not limited to the Poisson model given here; any\nmodel of null and alternative hypotheses such that the resulting statistic D(S) satisfies the\nconditions given in [4] can be used for the fast spatial scan.\n\n1.2 Randomization testing\nOnce we have found the highest scoring region S = arg maxS D(S) of grid G, we must still\ndetermine the statistical significance of this region. Since the exact distribution of the test\nstatistic Dmax is only known in special cases, in general we must find the region's p-value by\nrandomization. To do so, we run a large number R of random replications, where a replica\n\n\f\nhas the same underlying baselines bi as G, but counts are randomly drawn from the null\nhypothesis H0(S). More precisely, we pick ci Po(qbi), where q = qin = (1+) Ctot\n Btot +B(S)\nfor si S, and q = qout = Ctot for s\n B i\n tot +B(S) G - S. The number of replicas G with\nDmax(G ) Dmax(G), divided by the total number of replications R, gives us the p-value\nfor our most significant region S. If this p-value is less than (where is the false positive\nrate, typically chosen to be 0.05 or 0.1), we can conclude that the discovered region is\nstatistically significant at level .\n\n1.3 The naive spatial scan\nThe simplest method of finding Dmax is to compute D(S) for all rectangular regions of sizes\nk1 k2 ...kd, where 1 kj Nj. Since there are a total of d (N\n j=1 j - kj + 1) regions\nof each size, there are a total of O(d N2)\n j=1 regions to examine. We can compute D(S)\n j\nfor any region S in constant time, by first finding the count C(S) and baseline B(S), then\ncomputing D.1 This allows us to compute Dmax of a grid G in O(d N2)\n j=1 time. However,\n j\nsignificance testing by randomization also requires us to find Dmax for each replica G ,\nand compare this to Dmax(G); thus the total complexity is multiplied by the number of\nreplications R. When the size of the grid is large, as is the case for the epidemiological and\nfMRI datasets we are considering, this naive approach is computationally infeasible.\n\nInstead, we apply our \"overlap-multiresolution partitioning\" algorithm [3-4], generalizing\nthis method from two-dimensional to d-dimensional datasets. This reduces the complexity\nto O(d N\n j=1 j log N j ) in cases where the most significant region S has a sufficiently high ra-\ntio of count to baseline, and (as we show in Section 3) typically results in tens to thousands\nof times speedup over the naive approach. We note that this fast spatial scan algorithm is\nexact (always finds the correct value of Dmax and the corresponding region S); the speedup\nresults from the observation that we do not need to search a given set of regions if we can\nprove that none of them have score > Dmax. Thus we use a top-down, branch-and-bound\napproach: we maintain the current maximum score of the regions we have searched so far,\ncalculate upper bounds on the scores of subregions contained in a given region, and prune\nregions whose upper bounds are less than the current value of Dmax. When searching a\nreplica grid, we care only whether Dmax of the replica grid is greater than Dmax(G). Thus\nwe can use Dmax of the original grid for pruning on the replicas, and can stop searching a\nreplica if we find a region with score > Dmax(G).\n\n2 Overlap-multiresolution partitioning\n\nAs in [4], we use a multiresolution search method which relies on an overlap-kd tree data\nstructure. The overlap-kd tree, like kd-trees [5] and quadtrees [6], is a hierarchical, space-\npartitioning data structure. The root node of the tree represents the entire space under\nconsideration (i.e. the entire grid G), and each other node represents a subregion of the\ngrid. Each non-leaf node of a d-dimensional overlap-kd tree has 2d children, an \"upper\"\nand a \"lower\" child in each dimension. For example, in three dimensions, a node has six\nchildren: upper and lower children in the x, y, and z dimensions. The overlap-kd tree is\ndifferent from the standard kd-tree and quadtree in that adjacent regions overlap: rather\nthan splitting the region in half along each dimension, instead each child contains more\nthan half the area of the parent region. For example, a 64 64 64 grid will have six\nchildren: two of size 48 6464, two of size 644864, and two of size 646448.\n 1An old trick makes it possible to compute the count and baseline of any rectangular region in\ntime constant in N: we first form a d-dimensional array of the cumulative counts, then compute\neach region's count by adding/subtracting at most 2d cumulative counts. Note that because of the\nexponential dependence on d, these techniques suffer from the \"curse of dimensionality\": neither the\nnaive spatial scan, nor the fast spatial scan discussed below, are appropriate for very high dimensional\ndatasets.\n\n\f\nIn general, let region S have size k1 k2...kd. Then the two children of S in dimension\nj (for j = 1 . . . d) have size k1 ...kj-1 fjkj kj+1 ...kd, where 1 < f\n 2 j < 1. This\npartitioning (for the two-dimensional case, where f1 = f2 = 3 ) is illustrated in Figure 1.\n 4\nNote that there is a region SC common to all of these children; we call this region the center\nof S. When we partition region S in this manner, it can be proved that any subregion of S\neither a) is contained entirely in (at least) one of S1 . . . S2d, or b) contains the center region\nSC. Figure 1 illustrates each of these possibilities, for the simple case of d = 2.\n\n\n S Figure 1: Overlap-multires partitioning\n of region S (for d = 2). Any subregion\n of S either a) is contained in some S\n S_1 S_2 S_3 i,\n S_4 S_C i = 1 . . . 4, or b) contains SC.\n\n\nNow we can search all subregions of S by recursively searching S1 . . . S2d, then searching\nall of the regions contained in S which contain the center SC. There may be a large number\nof such \"outer regions,\" but since we know that each such region contains the center, we\ncan place very tight bounds on the score of these regions, often allowing us to prune most\nor all of them. Thus the basic outline of our search procedure (ignoring pruning, for the\nmoment) is:\noverlap-search(S)\n{\n call base-case-search(S)\n define child regions S_1..S_2d, center S_C as above\n call overlap-search(S_i) for i=1..2d\n for all S' such that S' is contained in S and contains S_C, call base-case-search(S')\n}\n\nThe fractions fi are selected based on the current sizes ki of the region being searched:\nif ki = 2m, then fi = 3 , and if k . For simplicity, we assume that\n 4 i = 3 2m, then fi = 23\nall Ni are powers of two, and thus all region sizes ki will fall into one of these two cases.\nRepeating this partitioning recursively, we obtain the overlap-kd tree structure. For d = 2,\nthe first two levels of the overlap-kd tree are shown in Figure 2.\n\n\n\n Figure 2: The first two levels of the two-\n dimensional overlap-kd tree. Each node\n represents a gridded region (denoted by\n a thick rectangle) of the entire dataset\n (thin square and dots).\n\n\n\n\nThe overlap-kd tree has several useful properties, which we present here without proof.\nFirst, for every rectangular region S G, either S is a gridded region (contained in the\noverlap-kd tree), or there exists a unique gridded region S such that S is an outer region\nof S (i.e. S is contained in S , and contains the center region of S ). This means that, if\noverlap-search is called exactly once for each gridded region2, and no pruning is done, then\nbase-case-search will be called exactly once for every rectangular region S G. In practice,\nwe will prune many regions, so base-case-search will be called at most once for every rect-\nangular region, and every region will be either searched or pruned. The second nice prop-\nerty of our overlap-kd tree is that the total number of gridded regions is O(d N\n j=1 j log N j ).\nThis implies that, if we are able to prune (almost) all outer regions, we can find Dmax of the\ngrid in O(d N N2)\n j=1 j log N j ) time rather than O(dj=1 . In fact, we may not even need to\n j\nsearch all gridded regions, so in many cases the search will be even faster.\n 2As in [4], we use \"lazy expansion\" to ensure that gridded regions are not multiply searched.\n\n\f\n2.1 Score bounds and pruning\nWe now consider which regions can be pruned (discarded without searching) during our\nmultiresolution search procedure. First, given some region S, we must calculate an upper\nbound on the scores D(S ) for regions S S. More precisely, we are interested in two\nupper bounds: a bound on the score of all subregions S S, and a bound on the score of\nthe outer subregions of S (those regions contained in S and containing its center SC). If the\nfirst bound is less than or equal to Dmax, we can prune region S completely; we do not need\nto search any (gridded or outer) subregion of S. If only the second bound is less than or\nequal to Dmax, we do not need to search the outer subregions of S, but we must recursively\ncall overlap-search on the gridded children of S. If both bounds are greater than Dmax, we\nmust both recursively call overlap-search and search the outer regions.\n\nScore bounds are calculated based on various pieces of information about the subregions\nof S, including: upper and lower bounds bmax, bmin on the baseline of subregions S ; an\nupper bound dmax on the ratio C of S ; an upper bound d of S\n B inc on the ratio C\n B -SC; and\na lower bound dmin on the ratio C of S\n B -S . We also know the count C and baseline B of\nregion S, and the count ccenter and baseline bcenter of region SC. Let cin and bin be the count\nand baseline of S . To find an upper bound on D(S ), we must calculate the values of cin\nand bin which maximize D subject to the given constraints: cin-ccenter\n bin-bcenter dinc, cin\n bin dmax,\nC-cin\nB-bin dmin, and bmin bin bmax. The solution to this maximization problem is derived\nin [4], and (since scores are based only on count and baseline rather than the size and shape\nof the region) it applies directly to the multidimensional case. The bounds on baselines and\nratios C are first calculated using global values (as a fast, \"first-pass\" pruning technique).\n B\nFor the remaining, unpruned regions, we calculate tighter bounds using the quartering\nmethod of [4], and use these to prune more regions.\n\n2.2 Related work\nOur work builds most directly on the results of Kulldorff [1], who presents the two-\ndimensional spatial scan framework and the classical ( = 0) likelihood ratio statistic. It\nalso extends [4], in which we present the two-dimensional fast spatial scan. Our major\nextensions in the present work are twofold: the d-dimensional fast spatial scan, and the\ngeneralized likelihood ratio statistics D. A variety of other cluster detection techniques\nexist in the literature on epidemiology [1-3, 7-8], brain imaging [9-11], and machine learn-\ning [12-15]. The machine learning literature focuses on heuristic or approximate cluster-\nfinding techniques, which typically cannot deal with spatially varying baselines, and more\nimportantly, give no information about the statistical significance of the clusters found.\nOur technique is exact (in that it calculates the maximum of the likelihood ratio statistic\nover all hyper-rectangular spatial regions), and uses a powerful statistical test to determine\nsignificance. Nevertheless, other methods in the literature have some advantages over the\npresent approach, such as applicability to high-dimensional data and fewer assumptions\non the underlying model. The fMRI literature generally tests significance on a per-voxel\nbasis (after applying some method of spatial smoothing); clusters must then be inferred\nby grouping individually significant voxels, and (with the exception of [10]) no per-cluster\nfalse positive rate is guaranteed. The epidemiological literature focuses on detecting signif-\nicant circular, two-dimensional clusters, and thus cannot deal with multidimensional data\nor elongated regions. Detection of elongated regions is extremely important in both epi-\ndemiology (because of the need to detect windborne or waterborne pathogens) and brain\nimaging (because of the \"folded sheet\" structure of the brain); the present work, as well as\n[4], allow detection of such clusters.\n\n3 Results\n\nWe now describe results of our fast spatial scan algorithm on three sets of real-world data:\ntwo sets of epidemiological data (from emergency department visits and over-the-counter\n\n\f\ndrug sales), and one set of fMRI data. Before presenting these results, we wish to em-\nphasize three main points. First, the extension of scan statistics from two-dimensional to\nd-dimensional datasets dramatically increases the scope of problems for which these tech-\nniques can be used. In addition to datasets with more than two spatial dimensions (for\nexample, the fMRI data, which consists of a 3D picture of the brain), we can also examine\ndata with a temporal component (as in the OTC dataset), or where we wish to take demo-\ngraphic information into account (as in the ED dataset). Second, in all of these datasets, the\nuse of the broader class of likelihood ratio statistics D (instead of only the classical scan\nstatistic = 0) allows us to focus our search on smaller, denser regions rather than slight\n(but statistically significant) increases over a large area. Third, as our results here will\ndemonstrate, the fast spatial scan gains huge performance improvements over the naive\napproach, making the use of the scan statistic feasible in these large, real-world datasets.\n\nOur first test set was a database of (anonymized) Emergency Department data collected\nfrom Western Pennsylvania hospitals in the period 1999-2002. This dataset contains a total\nof 630,000 records, each representing a single ED visit and giving the latitude and longi-\ntude of the patient's home location to the nearest 1 mile (a sufficiently low resolution to\n 3\nensure anonymity). Additionally, a record contains information about the patient's gender\nand age decile. Thus we map records into a four-dimensional grid, consisting of two spa-\ntial dimensions (longitude, latitude) and two \"pseudo-spatial\" dimensions (patient gender\nand age decile). This has several advantages over the traditional (two-dimensional) spatial\nscan. First, our test has higher power to detect syndromes which affect differing patient\ndemographics to different extents. For example, if a disease primarily strikes male infants,\nwe might find a cluster with gender = male and age decile = 0 in some spatial region, and\nthis cluster may not be detectable from the combined data. Second, our method accounts\ncorrectly for multiple hypothesis testing. If we were to instead perform a separate test at\nlevel on each combination of gender and age decile, the overall false positive rate would\nbe much higher than . We mapped the ED dataset to a 128 12828 grid, with the\nfirst two coordinates corresponding to longitude and latitude, the third coordinate corre-\nsponding to the patient's gender, and the fourth coordinate corresponding to the patient's\nage decile. We tested for spatial clustering of \"recent\" disease cases: the count of a cell was\nthe number of ED visits in that spatial region, for patients of that age and gender, in 2002,\nand the baseline was the total number of ED visits in that spatial region, for patients of that\nage and gender, over the entire temporal period 1999-2002. We used the D scan statistic\nwith values of ranging from 0 to 1.0. For the classical scan statistic ( = 0), we found a\nregion of size 35 3428; thus the most significant region was spatially localized but\ncut across all genders and age groups. The region had C = 3570 and B = 6409, as compared\nto C = 0.05 outside the region, and thus this is clearly an overdensity. This was confirmed\n B\nby the algorithm, which found the region statistically significant (p-value 0/100). With\nthe three other values of , the algorithm found almost the same region (35 3328,\nC = 3566, B = 6390) and again found it statistically significant (p-value 0/100). For all\nvalues of , the fast scan statistic found the most significant region hundreds of times faster\nthan the naive spatial scan (see Table 1): while the naive approach required approximately\n12 hours per replication, the fast scan searched each replica in approximately 2 minutes,\nplus 100 minutes to search the original grid. Thus the fast algorithm achieved speedups of\n235-325x over the naive approach for the entire run (i.e. searching the original grid and\n100 replicas) on the ED dataset.\n\nOur second test set was a nationwide database of retail sales of over-the-counter cough\nand cold medication. Sales figures were reported by zip code; the data covered 5000 zip\ncodes across the U.S. In this case, our goal was to see if the spatial distribution of sales in\na given week (February 7-14, 2004) was significantly different than the spatial distribution\nof sales during the previous week, and to identify a significant cluster of increased sales if\none exists. Since we wanted to detect clusters even if they were only present for part of the\nweek, we used the date (Feb. 7-14) as a third dimension. This is similar to the retrospective\n\n\f\n Table 1: Performance of algorithm, real-world datasets\n test sec/orig sec/rep speedup regions (orig) regions (rep)\n ED 0 6140 126 x235 358M 622K\n (128 12828) 0.25 6035 100 x275 352M 339K\n (7.35B regions) 0.5 5994 102 x272 348M 362K\n 1.0 5607 79.6 x325 334M 336K\n OTC 0 4453 195 x48 302M 2.46M\n (128 1288) 0.25 429 123 x90 12.2M 1.39M\n (2.45B regions) 0.5 334 51 x210 8.65M 350K\n 1.0 229 5.9 x1400 4.40M < 10\n fMRI 0 880 384 x7 39.9M 14.0M\n (64 6416) 0.01 597 285 x9 35.2M 10.4M\n (588M regions) 0.02 558 188 x14 33.1M 6.65M\n 0.03 547 97.3 x27 32.3M 3.93M\n 0.04 538 30.0 x77 31.9M 1.44M\n 0.05 538 13.1 x148 31.7M 310K\n\n\n\n\nspace-time scan statistic of [16], which also uses time as a third dimension. However,\nthat algorithm searches over cylinders rather than hyper-rectangles, and thus cannot detect\nspatially elongated clusters. The count of a cell was taken to be the number of sales in that\nspatial region on that day; to adjust for day-of-week effects, the baseline of a cell was taken\nto be the number of sales in that spatial region on the day one week prior (Jan. 31-Feb. 7).\nThus we created a 128 1288 grid, where the first two coordinates were derived from\nthe longitude and latitude of that zip code, and the third coordinate was temporal, based on\nthe date. For this dataset, the classical scan statistic ( = 0) found a region of size 123 \n76 from February 7-11. Unfortunately, since the ratio C was only 0.99 inside the region\n B\n(as compared to 0.96 outside) this region would not be interesting to an epidemiologist.\nNevertheless, the region was found to be significant (p-value 0/100) because of the large\ntotal baseline. Thus, in this case, the classical scan statistic finds a large region of very slight\noverdensity rather than a smaller, denser region, and thus is not as useful for detecting\nepidemics. For = 0.25 and = 0.5, the scan statistic found a much more interesting\nregion: a 4 1 region on February 9 where C = 882 and B = 240. In this region, the\nnumber of sales of cough medication was 3.7x its expected value; the region's p-value was\ncomputed to be 0/100, so this is a significant overdensity. For = 1, the region found was\nalmost the same, consisting of three of these four cells, with C = 825 and B = 190. Again\nthe region was found to be significant (p-value 0/100). For this dataset, the naive approach\ntook approximately three hours per replication. The fast scan statistic took between six\nseconds and four minutes per replication, plus ten minutes to search the original grid, thus\nobtaining speedups of 48-1400x on the OTC dataset.\n\nOur third and final test set was a set of fMRI data, consisting of two \"snapshots\" of a\nsubject's brain under null and experimental conditions respectively. The experimental con-\ndition was from a test [9] where the subject is given words, one at a time; he must read these\nwords and identify them as verbs or nouns. The null condition is the subject's average brain\nactivity while fixating on a cursor, before any words are presented. Each snapshot consists\nof a 64 64 16 grid of voxels, with a reading of fMRI activation for the subset of the\nvoxels where brain activity is occurring. In this case, the count of a cell is the fMRI activa-\ntion for that voxel under the experimental condition, and the baseline is the activation for\nthat voxel under the null condition. For voxels with no brain activity, we have ci = bi = 0.\nFor the fMRI dataset, the amount of change between activated and non-activated regions is\nsmall, and thus we used values of ranging from 0 to 0.05.\n\nFor the classical scan statistic ( = 0) our algorithm found a 23 2011 region, and again\nfound this region significant (p-value 0/100). However, this is another example where the\n\n\f\nclassical scan statistic finds a region which is large ( 1 of the entire brain) and only slightly\n 4\nincreased in count: C = 1.007 inside the region and C = 1.002 outside the region. For\n B B\n = 0.01, we find a more interesting cluster: a 5 101 region in the visual cortex con-\ntaining four non-zero voxels.3 For this region C = 1.052, a large increase, and the region\n B\nis significant at = 0.1 (p-value 10/100) though not at = 0.05. For = 0.02, we find\nthe same region, but conclude that it is not significant (p-value 32/100). For = 0.03 and\n = 0.04, we find a 3 21 region with C = 1.065, but this region is not significant (p-\n B\nvalues 61/100 and 89/100 respectively). Similarly, for = 0.05, we find a single voxel with\nC = 1.075, but again it is not significant (p-value 91/100). For this dataset, the naive ap-\nB\nproach took approximately 45 minutes per replication. The fast scan statistic took between\n13 seconds and six minutes per replication, thus obtaining speedups of 7-148x on the fMRI\ndataset.\n\nThus we have demonstrated (through tests on a variety of real-world datasets) that the\nfast multidimensional spatial scan statistic has significant performance advantages over the\nnaive approach, resulting in speedups up to 1400x without any loss of accuracy. This makes\nit feasible to apply scan statistics in a variety of application domains, including the spatial\nand spatio-temporal detection of disease epidemics (taking demographic information into\naccount), as well as the detection of regions of increased brain activity in fMRI data. We\nare currently examining each of these application domains in more detail, and investigating\nwhich statistics are most useful for each domain. The generalized likelihood ratio statistics\npresented here are a first step toward this: by adjusting the parameter , we can \"tune\" the\nstatistic to detect smaller and denser, or larger but less dense, regions as desired, and our\nstatistical significance test is adjusted accordingly. We believe that the combination of fast\ncomputational algorithms and more powerful statistical tests presented here will enable the\nmultidimensional spatial scan statistic to be useful in these and many other applications.\n\nReferences\n[1] M. Kulldorff. 1997. A spatial scan statistic. Communications in Statistics: Theory and Methods 26(6), 1481-1496.\n\n[2] M. Kulldorff. 1999. Spatial scan statistics: models, calculations, and applications. In Glaz and Balakrishnan, eds. Scan\nStatistics and Applications. Birkhauser: Boston, 303-322.\n\n[3] D. B. Neill and A. W. Moore. 2003. A fast multi-resolution method for detection of significant spatial disease clusters. In\nAdvances in Neural Information Processing Systems 16.\n\n[4] D. B. Neill and A. W. Moore. 2004. Rapid detection of significant spatial clusters. To be published in Proc. 10th ACM\nSIGKDD Intl. Conf. on Knowledge Discovery and Data Mining.\n\n[5] J. L. Bentley. 1975. Multidimensional binary search trees used for associative searching. Comm. ACM 18, 509-517.\n\n[6] R. A. Finkel and J. L. Bentley. 1974. Quadtrees: a data structure for retrieval on composite keys. Acta Informatica 4, 1-9.\n\n[7] S. Openshaw, et al. 1988. Investigation of leukemia clusters by use of a geographical analysis machine. Lancet 1, 272-273.\n\n[8] L. A. Waller, et al. 1994. Spatial analysis to detect disease clusters. In N. Lange, ed. Case Studies in Biometry. Wiley, 3-23.\n\n[9] T. Mitchell et al. 2003. Learning to detect cognitive states from brain images. Machine Learning, in press.\n\n[10] M. Perone Pacifico et al. 2003. False discovery rates for random fields. Carnegie Mellon University Dept. of Statistics,\nTechnical Report 771.\n\n[11] K. Worsley et al. 2003. Detecting activation in fMRI data. Stat. Meth. in Medical Research 12, 401-418.\n\n[12] R. Agrawal, et al. 1998. Automatic subspace clustering of high dimensional data for data mining applications. Proc.\nACM-SIGMOD Intl. Conference on Management of Data, 94-105.\n\n[13] J. H. Friedman and N. I. Fisher. 1999. Bump hunting in high dimensional data. Statistics and Computing 9, 123-143.\n\n[14] S. Goil, et al. 1999. MAFIA: efficient and scalable subspace clustering for very large data sets. Northwestern University,\nTechnical Report CPDC-TR-9906-010.\n\n[15] W. Wang, et al. 1997. STING: a statistical information grid approach to spatial data mining. Proc. 23rd Conference on Very\nLarge Databases, 186-195.\n\n[16] M. Kulldorff. 1998. Evaluating cluster alarms: a space-time scan statistic and brain cancer in Los Alamos. Am. J. Public\nHealth 88, 1377-1380.\n\n 3In a longer run on a different subject, where we iterate the scan statistic to pick out multiple\nsignificant regions, we found significant clusters in Broca's and Wernicke's areas in addition to the\nvisual cortex. This makes sense given the nature of the experimental task; however, more data is\nneeded before we can draw conclusive cross-subject comparisons.\n\n\f\n", "award": [], "sourceid": 2591, "authors": [{"given_name": "Daniel", "family_name": "Neill", "institution": null}, {"given_name": "Andrew", "family_name": "Moore", "institution": null}, {"given_name": "Francisco", "family_name": "Pereira", "institution": null}, {"given_name": "Tom", "family_name": "Mitchell", "institution": null}]}