Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)
Daniel Neill, Andrew Moore, Francisco Pereira, Tom M. Mitchell
Assume a uniform, multidimensional grid of bivariate data, where each cell of the grid has a count ci and a baseline bi. Our goal is to find spatial regions (d-dimensional rectangles) where the ci are significantly higher than expected given bi. We focus on two applications: detection of clusters of disease cases from epidemiological data (emergency depart- ment visits, over-the-counter drug sales), and discovery of regions of in- creased brain activity corresponding to given cognitive tasks (from fMRI data). Each of these problems can be solved using a spatial scan statistic (Kulldorff, 1997), where we compute the maximum of a likelihood ratio statistic over all spatial regions, and find the significance of this region by randomization. However, computing the scan statistic for all spatial regions is generally computationally infeasible, so we introduce a novel fast spatial scan algorithm, generalizing the 2D scan algorithm of (Neill and Moore, 2004) to arbitrary dimensions. Our new multidimensional multiresolution algorithm allows us to find spatial clusters up to 1400x faster than the naive spatial scan, without any loss of accuracy.
1 Introduction
One of the core goals of modern statistical inference and data mining is to discover patterns and relationships in data. In many applications, however, it is important not only to discover patterns, but to distinguish those patterns that are significant from those that are likely to have occurred by chance. This is particularly important in epidemiological applications, where a rise in the number of disease cases in a region may or may not be indicative of an emerging epidemic. In order to decide whether further investigation is necessary, epidemiologists must know not only the location of a possible outbreak, but also some measure of the likelihood that an outbreak is occurring in that region. Similarly, when investigating brain imaging data, we want to not only find regions of increased activity, but determine whether these increases are significant or due to chance fluctuations.
More generally, we are interested in spatial data mining problems where the goal is detec- tion of overdensities: spatial regions with high counts relative to some underlying baseline. In the epidemiological datasets, the count is some quantity (e.g. number of disease cases, or units of cough medication sold) in a given area, where the baseline is the expected value of that quantity based on historical data. In the brain imaging datasets, our count is the total fMRI activation in a given set of voxels under the experimental condition, while our baseline is the total activation in that set of voxels under the null or control condition.
We consider the case in which data has been aggregated to a uniform, d-dimensional grid. For the fMRI data, we have three spatial dimensions; for the epidemiological data, we have two spatial dimensions but also use several other quantities (time, patients' age and gender) as "pseudo-spatial" dimensions; this is discussed in more detail below.
In the general case, let G be a d-dimensional grid of cells, with size N1 N2 ... Nd. Each cell si G (where i is a d-dimensional vector) is associated with a count ci and a baseline bi. Our goal is to search over all d-dimensional rectangular regions S G, and find regions where the total count C(S) = S ci is higher than expected, given the baseline B(S) = S bi. In addition to discovering these high-density regions, we must also perform statistical testing to determine whether these regions are significant. As is necessary in the scan statistics framework, we focus on finding the single, most significant region; the method can be iterated (removing each significant cluster once it is found) to find multiple significant regions.
1.1 Likelihood ratio statistics Our basic model assumes that counts ci are generated by an inhomogeneous Poisson pro- cess with mean qbi, where q (the underlying ratio of count to baseline) may vary spatially. We wish to detect hyper-rectangular regions S such that q is significantly higher inside S than outside S. To do so, for a given region S, we assume that q = qin uniformly for cells si S, and q = qout uniformly for cells si G-S. We then test the null hypothesis H0(S): qin (1+)qout against the alternative hypothesis H1(S): qin > (1+)qout. If = 0, this is equivalent to the classical spatial scan statistic [1-2]: we are testing for regions where qin is greater than qout . However, in many real-world applications (including the epidemiological and fMRI datasets discussed later) we expect some fluctuation in the underlying baseline; thus, we do not want to detect all deviations from baseline, but only those where the amount of deviation is greater than some threshold. For example, a 10% increase in disease cases in some region may not be interesting to epidemiologists, even if the underlying population is large enough to conclude that this is a "real" (statistically significant) increase in q. By increasing , we can focus the scan statistic on regions with larger ratios of count to base- line. For example, we can use the scan statistic with = 0.25 to test for regions where qin is more than 25% higher than qout . Following Kulldorff [1], our spatial scan statistic is the maximum, over all regions S, of the ratio of the likelihoods under the alternative and null hypotheses. Taking logs for convenience, we have:
sup q s D i (S) = log in>(1+)qout S P(ci Po(qinbi))siG-S P(ci Po(qoutbi)) sup qin(1+)qout siS P(ci Po(qinbi))siG-S P(ci Po(qoutbi)) C(S) C C = ( tot tot sgn) C(S) log + (C -C(S) ( tot 1 + )B(S) -C(S))log Btot -B(S) -CtotlogBtot+B(S) where C(S) and B(S) are the count and baseline of the region S under consideration, Ctot and Btot are the total count and baseline of the entire grid G, and sgn = +1 if C(S) > (1 + B(S) )Ctot-C(S) and -1 otherwise. Then the scan statistic D B ,max is equal to the maximum D(S) tot -B(S) over all spatial regions (d-dimensional rectangles) under consideration. We note that our statistical and computational methods are not limited to the Poisson model given here; any model of null and alternative hypotheses such that the resulting statistic D(S) satisfies the conditions given in [4] can be used for the fast spatial scan.
1.2 Randomization testing Once we have found the highest scoring region S = arg maxS D(S) of grid G, we must still determine the statistical significance of this region. Since the exact distribution of the test statistic Dmax is only known in special cases, in general we must find the region's p-value by randomization. To do so, we run a large number R of random replications, where a replica
has the same underlying baselines bi as G, but counts are randomly drawn from the null hypothesis H0(S). More precisely, we pick ci Po(qbi), where q = qin = (1+) Ctot Btot +B(S) for si S, and q = qout = Ctot for s B i tot +B(S) G - S. The number of replicas G with Dmax(G ) Dmax(G), divided by the total number of replications R, gives us the p-value for our most significant region S. If this p-value is less than (where is the false positive rate, typically chosen to be 0.05 or 0.1), we can conclude that the discovered region is statistically significant at level .
1.3 The naive spatial scan The simplest method of finding Dmax is to compute D(S) for all rectangular regions of sizes k1 k2 ...kd, where 1 kj Nj. Since there are a total of d (N j=1 j - kj + 1) regions of each size, there are a total of O(d N2) j=1 regions to examine. We can compute D(S) j for any region S in constant time, by first finding the count C(S) and baseline B(S), then computing D.1 This allows us to compute Dmax of a grid G in O(d N2) j=1 time. However, j significance testing by randomization also requires us to find Dmax for each replica G , and compare this to Dmax(G); thus the total complexity is multiplied by the number of replications R. When the size of the grid is large, as is the case for the epidemiological and fMRI datasets we are considering, this naive approach is computationally infeasible.
Instead, we apply our "overlap-multiresolution partitioning" algorithm [3-4], generalizing this method from two-dimensional to d-dimensional datasets. This reduces the complexity to O(d N j=1 j log N j ) in cases where the most significant region S has a sufficiently high ra- tio of count to baseline, and (as we show in Section 3) typically results in tens to thousands of times speedup over the naive approach. We note that this fast spatial scan algorithm is exact (always finds the correct value of Dmax and the corresponding region S); the speedup results from the observation that we do not need to search a given set of regions if we can prove that none of them have score > Dmax. Thus we use a top-down, branch-and-bound approach: we maintain the current maximum score of the regions we have searched so far, calculate upper bounds on the scores of subregions contained in a given region, and prune regions whose upper bounds are less than the current value of Dmax. When searching a replica grid, we care only whether Dmax of the replica grid is greater than Dmax(G). Thus we can use Dmax of the original grid for pruning on the replicas, and can stop searching a replica if we find a region with score > Dmax(G).
2 Overlap-multiresolution partitioning
As in [4], we use a multiresolution search method which relies on an overlap-kd tree data structure. The overlap-kd tree, like kd-trees [5] and quadtrees [6], is a hierarchical, space- partitioning data structure. The root node of the tree represents the entire space under consideration (i.e. the entire grid G), and each other node represents a subregion of the grid. Each non-leaf node of a d-dimensional overlap-kd tree has 2d children, an "upper" and a "lower" child in each dimension. For example, in three dimensions, a node has six children: upper and lower children in the x, y, and z dimensions. The overlap-kd tree is different from the standard kd-tree and quadtree in that adjacent regions overlap: rather than splitting the region in half along each dimension, instead each child contains more than half the area of the parent region. For example, a 64 64 64 grid will have six children: two of size 48 6464, two of size 644864, and two of size 646448. 1An old trick makes it possible to compute the count and baseline of any rectangular region in time constant in N: we first form a d-dimensional array of the cumulative counts, then compute each region's count by adding/subtracting at most 2d cumulative counts. Note that because of the exponential dependence on d, these techniques suffer from the "curse of dimensionality": neither the naive spatial scan, nor the fast spatial scan discussed below, are appropriate for very high dimensional datasets.
In general, let region S have size k1 k2...kd. Then the two children of S in dimension j (for j = 1 . . . d) have size k1 ...kj-1 fjkj kj+1 ...kd, where 1 < f 2 j < 1. This partitioning (for the two-dimensional case, where f1 = f2 = 3 ) is illustrated in Figure 1. 4 Note that there is a region SC common to all of these children; we call this region the center of S. When we partition region S in this manner, it can be proved that any subregion of S either a) is contained entirely in (at least) one of S1 . . . S2d, or b) contains the center region SC. Figure 1 illustrates each of these possibilities, for the simple case of d = 2.
S Figure 1: Overlap-multires partitioning of region S (for d = 2). Any subregion of S either a) is contained in some S S_1 S_2 S_3 i, S_4 S_C i = 1 . . . 4, or b) contains SC.
Now we can search all subregions of S by recursively searching S1 . . . S2d, then searching all of the regions contained in S which contain the center SC. There may be a large number of such "outer regions," but since we know that each such region contains the center, we can place very tight bounds on the score of these regions, often allowing us to prune most or all of them. Thus the basic outline of our search procedure (ignoring pruning, for the moment) is: overlap-search(S) { call base-case-search(S) define child regions S1..S2d, center SC as above call overlap-search(Si) for i=1..2d for all S' such that S' is contained in S and contains S_C, call base-case-search(S') }
The fractions fi are selected based on the current sizes ki of the region being searched: if ki = 2m, then fi = 3 , and if k . For simplicity, we assume that 4 i = 3 2m, then fi = 23 all Ni are powers of two, and thus all region sizes ki will fall into one of these two cases. Repeating this partitioning recursively, we obtain the overlap-kd tree structure. For d = 2, the first two levels of the overlap-kd tree are shown in Figure 2.
Figure 2: The first two levels of the two- dimensional overlap-kd tree. Each node represents a gridded region (denoted by a thick rectangle) of the entire dataset (thin square and dots).
The overlap-kd tree has several useful properties, which we present here without proof. First, for every rectangular region S G, either S is a gridded region (contained in the overlap-kd tree), or there exists a unique gridded region S such that S is an outer region of S (i.e. S is contained in S , and contains the center region of S ). This means that, if overlap-search is called exactly once for each gridded region2, and no pruning is done, then base-case-search will be called exactly once for every rectangular region S G. In practice, we will prune many regions, so base-case-search will be called at most once for every rect- angular region, and every region will be either searched or pruned. The second nice prop- erty of our overlap-kd tree is that the total number of gridded regions is O(d N j=1 j log N j ). This implies that, if we are able to prune (almost) all outer regions, we can find Dmax of the grid in O(d N N2) j=1 j log N j ) time rather than O(dj=1 . In fact, we may not even need to j search all gridded regions, so in many cases the search will be even faster. 2As in [4], we use "lazy expansion" to ensure that gridded regions are not multiply searched.
2.1 Score bounds and pruning We now consider which regions can be pruned (discarded without searching) during our multiresolution search procedure. First, given some region S, we must calculate an upper bound on the scores D(S ) for regions S S. More precisely, we are interested in two upper bounds: a bound on the score of all subregions S S, and a bound on the score of the outer subregions of S (those regions contained in S and containing its center SC). If the first bound is less than or equal to Dmax, we can prune region S completely; we do not need to search any (gridded or outer) subregion of S. If only the second bound is less than or equal to Dmax, we do not need to search the outer subregions of S, but we must recursively call overlap-search on the gridded children of S. If both bounds are greater than Dmax, we must both recursively call overlap-search and search the outer regions.
Score bounds are calculated based on various pieces of information about the subregions of S, including: upper and lower bounds bmax, bmin on the baseline of subregions S ; an upper bound dmax on the ratio C of S ; an upper bound d of S B inc on the ratio C B -SC; and a lower bound dmin on the ratio C of S B -S . We also know the count C and baseline B of region S, and the count ccenter and baseline bcenter of region SC. Let cin and bin be the count and baseline of S . To find an upper bound on D(S ), we must calculate the values of cin and bin which maximize D subject to the given constraints: cin-ccenter bin-bcenter dinc, cin bin dmax, C-cin B-bin dmin, and bmin bin bmax. The solution to this maximization problem is derived in [4], and (since scores are based only on count and baseline rather than the size and shape of the region) it applies directly to the multidimensional case. The bounds on baselines and ratios C are first calculated using global values (as a fast, "first-pass" pruning technique). B For the remaining, unpruned regions, we calculate tighter bounds using the quartering method of [4], and use these to prune more regions.
2.2 Related work Our work builds most directly on the results of Kulldorff [1], who presents the two- dimensional spatial scan framework and the classical ( = 0) likelihood ratio statistic. It also extends [4], in which we present the two-dimensional fast spatial scan. Our major extensions in the present work are twofold: the d-dimensional fast spatial scan, and the generalized likelihood ratio statistics D. A variety of other cluster detection techniques exist in the literature on epidemiology [1-3, 7-8], brain imaging [9-11], and machine learn- ing [12-15]. The machine learning literature focuses on heuristic or approximate cluster- finding techniques, which typically cannot deal with spatially varying baselines, and more importantly, give no information about the statistical significance of the clusters found. Our technique is exact (in that it calculates the maximum of the likelihood ratio statistic over all hyper-rectangular spatial regions), and uses a powerful statistical test to determine significance. Nevertheless, other methods in the literature have some advantages over the present approach, such as applicability to high-dimensional data and fewer assumptions on the underlying model. The fMRI literature generally tests significance on a per-voxel basis (after applying some method of spatial smoothing); clusters must then be inferred by grouping individually significant voxels, and (with the exception of [10]) no per-cluster false positive rate is guaranteed. The epidemiological literature focuses on detecting signif- icant circular, two-dimensional clusters, and thus cannot deal with multidimensional data or elongated regions. Detection of elongated regions is extremely important in both epi- demiology (because of the need to detect windborne or waterborne pathogens) and brain imaging (because of the "folded sheet" structure of the brain); the present work, as well as [4], allow detection of such clusters.