{"title": "In-Network PCA and Anomaly Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 617, "page_last": 624, "abstract": null, "full_text": "In-Network PCA and Anomaly Detection\n\nLing Huang\n\nUniversity of California\n\nBerkeley, CA 94720\nhling@cs.berkeley.edu\n\nXuanLong Nguyen\n\nUniversity of California\n\nBerkeley, CA 94720\n\nxuanlong@cs.berkeley.edu\n\nMinos Garofalakis\n\nIntel Research\n\nBerkeley, CA 94704\n\nminos.garofalakis@intel.com\n\nMichael I. Jordan\n\nUniversity of California\n\nBerkeley, CA 94720\njordan@cs.berkeley.edu\n\nAnthony Joseph\n\nUniversity of California\n\nBerkeley, CA 94720\nadj@cs.berkeley.edu\n\nNina Taft\n\nIntel Research\n\nBerkeley, CA 94704\nnina.taft@intel.com\n\nAbstract\n\nWe consider the problem of network anomaly detection in large distributed systems. In this\nsetting, Principal Component Analysis (PCA) has been proposed as a method for discover-\ning anomalies by continuously tracking the projection of the data onto a residual subspace.\nThis method was shown to work well empirically in highly aggregated networks, that is,\nthose with a limited number of large nodes and at coarse time scales. This approach, how-\never, has scalability limitations. To overcome these limitations, we develop a PCA-based\nanomaly detector in which adaptive local data (cid:2)lters send to a coordinator just enough data\nto enable accurate global detection. Our method is based on a stochastic matrix perturba-\ntion analysis that characterizes the tradeoff between the accuracy of anomaly detection and\nthe amount of data communicated over the network.\n\n1 Introduction\n\nThe area of distributed computing systems provides a promising domain for applications of machine\nlearning methods. One of the most interesting aspects of such applications is that learning algorithms\nthat are embedded in a distributed computing infrastructure are themselves part of that infrastructure\nand must respect its inherent local computing constraints (e.g., constraints on bandwidth, latency,\nreliability, etc.), while attempting to aggregate information across the infrastructure so as to improve\nsystem performance (or availability) in a global sense.\nConsider, for example, the problem of detecting anomalies in a wide-area network. While it is\nstraightforward to embed learning algorithms at local nodes to attempt to detect node-level anoma-\nlies, these anomalies may not be indicative of network-level problems. Indeed, in recent work, [8]\ndemonstrated a useful role for Principal Component Analysis (PCA) to detect network anomalies.\nThey showed that the minor components of PCA (the subspace obtained after removing the compo-\nnents with largest eigenvalues) revealed anomalies that were not detectable in any single node-level\ntrace. This work assumed an environment in which all the data is continuously pushed to a central\nsite for off-line analysis. Such a solution cannot scale either for networks with a large number of\nmonitors nor for networks seeking to track and detect anomalies at very small time scales.\nDesigning scalable solutions presents several challenges. Viable solutions need to process data (cid:147)in-\nnetwork(cid:148) to intelligently control the frequency and size of data communications. The key underlying\nproblem is that of developing a mathematical understanding of how to trade off quantization arising\nfrom local data (cid:2)ltering against (cid:2)delity of the detection analysis. We also need to understand how\nthis tradeoff impacts overall detection accuracy. Finally, the implementation needs to be simple if it\nis to have impact on developers.\n\n\fIn this paper, we present a simple algorithmic framework for network-wide anomaly detection that\nrelies on distributed tracking combined with approximate PCA analysis, together with supporting\ntheoretical analysis. In brief, the architecture involves a set of local monitors that maintain parame-\nterized sliding (cid:2)lters. These sliding (cid:2)lters yield quantized data streams that are sent to a coordinator.\nThe coordinator makes global decisions based on these quantized data streams. We use stochastic\nmatrix perturbation theory to both assess the impact of quantization on the accuracy of anomaly\ndetection, and to design a method that selects (cid:2)lter parameters in a way that bounds the detection\nerror. The combination of our theoretical tools and local (cid:2)ltering strategies results in an in-network\ntracking algorithm that can achieve high detection accuracy with low communication overhead; for\ninstance, our experiments show that, by choosing a relative eigen-error of 1:5% (yielding, approxi-\nmately, a 4% missed detection rate and a 6% false alarm rate), we can (cid:2)lter out more than 90% of\nthe traf(cid:2)c from the original signal.\nPrior Work. The original work on a PCA-based method by Lakhina et al. [8] has been extended\nby [17], who show how to infer network anomalies in both spatial and temporal domains. As with\n[8], this work is completely centralized. [14] and [1] propose distributed PCA algorithms distributed\nacross blocks of rows or columns of the data matrix; however, these methods are not applicable to\nour case. Furthermore, neither [14] nor [1] address the issue of continuously tracking principal\ncomponents within a given error tolerance or the issue of implementing a communication/accuracy\ntradeoff; issues which are the main focus of our work. Other initiatives in distributed monitoring,\npro(cid:2)ling and anomaly detection aim to share information and foster collaboration between widely\ndistributed monitoring boxes to offer improvements over isolated systems [12, 16]. Work in [2, 10]\nposits the need for scalable detection of network attacks and intrusions. In the setting of simpler\nstatistics such as sums and counts, in-network detection methods related to ours have been explored\nby [6]. Finally, recent work in the machine learning literature considers distributed constraints\nin learning algorithms such as kernel-based classi(cid:2)cation [11] and graphical model inference [7].\n(See [13] for a survey).\n\n2 Problem description and background\n\nWe consider a monitoring system comprising a set of local monitor nodes M1; : : : ; Mn, each of\nwhich collects a locally-observed time-series data stream (Fig. 1(a)). For instance, the monitors\nmay collect information on the number of TCP connection requests per second, the number of\nDNS transactions per minute, or the volume of traf(cid:2)c at port 80 per second. A central coordinator\nnode aims to continuously monitor the global collection of time series, and make global decisions\nsuch as those concerning matters of network-wide health. Although our methodology is generally\napplicable, in this paper we focus on the particular application of detecting volume anomalies. A\nvolume anomaly refers to unusual traf(cid:2)c load levels in a network that are caused by anomalies such\nas worms, distributed denial of service attacks, device failures, miscon(cid:2)gurations, and so on.\nEach monitor collects a new data point at every time step and, assuming a naive, (cid:147)continuous push(cid:148)\nprotocol, sends the new point to the coordinator. Based on these updates, the coordinator keeps track\nof a sliding time window of size m (i.e., the m most recent data points) for each monitor time series,\norganized into a matrix Y of size m (cid:2) n (where the ith column Yi captures the data from monitor\ni, see Fig. 1(a)). The coordinator then makes its decisions based solely on this (global) Y matrix.\nIn the network-wide volume anomaly detection algorithm of [8] the local monitors measure the total\nvolume of traf(cid:2)c (in bytes) on each network link, and periodically (e.g., every 5 minutes) centralize\nthe data by pushing all recent measurements to the coordinator. The coordinator then performs\nPCA on the assembled Y matrix to detect volume anomalies. This method has been shown to work\nremarkably well, presumably due to the inherently low-dimensional nature of the underlying data\n[9]. However, such a (cid:147)periodic push(cid:148) approach suffers from inherent limitations: To ensure fast\ndetection, the update periods should be relatively small; unfortunately, small periods also imply\nincreased monitoring communication overheads, which may very well be unnecessary (e.g., if there\nare no signi(cid:2)cant local changes across periods). Instead, in our work, we study how the monitors\ncan effectively (cid:2)lter their time-series updates, sending as little data as possible, yet enough so as\nto allow the coordinator to make global decisions accurately. We provide analytical bounds on the\nerrors that occur because decisions are made with incomplete data, and explore the tradeoff between\nreducing data transmissions (communication overhead) and decision accuracy.\n\n\f^Y =\n\nAnomaly\n\nData Flow\nResult\n\nM1\n\n1\n3\n5\n\nY =\n\nM2\n\n4\n7\n2\n\nM3\n\n3\n6\n1\n\nr\no\nt\nc\ne\nV\n \ne\nt\na\nt\nS\n\n3\n\n2\n\n1\n\n0\n\n2\n\nr\no\nt\nc\ne\nV\n\n \nl\na\nu\nd\ns\ne\nR\n\ni\n\n1.5\n\n1\n\n0.5\n\n0\n\nMn\n\n2\n5\n8\n\nx 1018\n\nMon\n\nTue\n\nWed\n\nThu\n\nFri\n\nSat\n\nSun\n\nx 1017\n\nMon\n\nTue\n\nWed\n\nThu\n\nFri\n\nSat\n\nSun\n\n(a) The system setup\n\n(b) Abilene network traf(cid:2)c data\n\nFigure 1: (a) The distributed monitoring system; (b) Data sample (kyk2) collected over one week (top); its\nprojection in residual subspace (bottom). Dashed line represents a threshold for anomaly detection.\n\nUsing PCA for centralized volume anomaly detection. As observed by Lakhina et al. [8], due to\nthe high level of traf(cid:2)c aggregation on ISP backbone links, volume anomalies can often go unno-\nticed by being (cid:147)buried(cid:148) within normal traf(cid:2)c patterns (e.g., the circle dots shown in the top plot in\nFig 1(b)). On the other hand, they observe that, although, the measured data is of seemingly high\ndimensionality (n = number of links), normal traf(cid:2)c patterns actually lie in a very low-dimensional\nsubspace; furthermore, separating out this normal traf(cid:2)c subspace using PCA (to (cid:2)nd the principal\ntraf(cid:2)c components) makes it much easier to identify volume anomalies in the remaining subspace\n(bottom plot of Fig. 1(b)).\nAs before, let Y be the global m (cid:2) n time-series data matrix, centered to have zero mean, and let\ny = y(t) denote a n-dimensional vector of measurements (for all links) from a single time step t.\nFormally, PCA is a projection method that maps a given set of data points onto principal compo-\nnents ordered by the amount of data variance that they capture. The set of n principal components,\nfvign\n\ni=1, are de(cid:2)ned as:\n\nvi = arg max\n\nkxk=1k(Y (cid:0)\n\nYvj vT\n\nj )xk\n\ni(cid:0)1\n\nXj=1\n\nm YT Y. As shown in [9],\nand are the n eigenvectors of the estimated covariance matrix A := 1\nPCA reveals that the Origin-Destination (OD) (cid:3)ow matrices of ISP backbones have low intrinsic\ndimensionality: For the Abilene network with 41 links, most data variance can be captured by the\n(cid:2)rst k = 4 principal components. Thus, the underlying normal OD (cid:3)ows effectively reside in a\n(low) k-dimensional subspace of Rn. This subspace is referred to as the normal traf(cid:2)c subspace\nSno. The remaining (n (cid:0) k) principal components constitute the abnormal traf(cid:2)c subspace Sab.\nDetecting volume anomalies relies on the decomposition of link traf(cid:2)c y = y(t) at any time step into\nnormal and abnormal components, y = yno +yab, such that (a) yno corresponds to modeled normal\ntraf(cid:2)c (the projection of y onto Sno), and (b) yab corresponds to residual traf(cid:2)c (the projection of y\nonto Sab). Mathematically, yno(t) and yab(t) can be computed as\n\nyno(t) = PPT y(t) = Cnoy(t)\n\nand\n\nyab(t) = (I (cid:0) PPT )y(t) = Caby(t)\n\nwhere P = [v1; v2; : : : ; vk] is formed by the (cid:2)rst k principal components which capture the dom-\ninant variance in the data. The matrix Cno = PPT represents the linear operator that performs\nprojection onto the normal subspace Sno, and Cab projects onto the abnormal subspace Sab.\nAs observed in [8], a volume anomaly typically results in a large change to yab; thus, a useful metric\nfor detecting abnormal traf(cid:2)c patterns is the squared prediction error (SPE):\n\nSPE (cid:17) kyabk2 = kCabyk2\n\n(essentially, a quadratic residual function). More formally, their proposed algorithm signals a vol-\nume anomaly if SPE > Q(cid:11), where Q(cid:11) denotes the threshold statistic for the SPE residual function\nat the 1 (cid:0) (cid:11) con(cid:2)dence level. Such a statistical test for the SPE residual function, known as the\nQ-statistic [4], can be computed as a function Q(cid:11) = Q(cid:11)((cid:21)k+1; : : : ; (cid:21)n) of the (n(cid:0)k) non-principal\neigenvalues of the covariance matrix A.\n\n\fDistr. Monitors\n\nY1(t)\n?\n\nFilter/\nPredict\n\n(cid:14)1\n\n-\n\nR1(t)\n\nY2(t)\n?\n\nFilter/\nPredict\n\nR2(t)\n\n(cid:14)2\n\n-\n\n-\n\nYn(t)\n?\n\nFilter/\nPredict\n\n(cid:14)n\n\n-\n\nRn(t)\n\nCoordinator\n\nInput: (cid:15)\n\n6\n\nAnomaly\n\n?\n\nPerturbation\n\nAnalysis\n\nw\n\nq\n\n(cid:30)\n\nSubspace\nMethod\n\n(cid:18)\n\n6\n\n?\n\nAdaptive\n(cid:14)1; : : : ; (cid:14)n\n\nw\n\nFigure 2: Our in-network tracking and detection framework.\n\n3 In-network PCA for anomaly detection\n\nWe now describe our version of an anomaly detector that uses distributed tracking and approximate\nPCA analysis. A key idea is to curtail the amount of data each monitor sends to the coordinator.\nBecause our job is to catch anomalies, rather than to track ongoing state, we point out that the\ncoordinator only needs to have a good approximation of the state when an anomaly is near. It need\nnot track global state very precisely when conditions are normal. This observation makes it intuitive\nthat a reduction in data sharing between monitors and the coordinator should be possible. We curtail\nthe amount of data (cid:3)ow from monitors to the coordinator by installing local (cid:2)lters at each monitor.\nThese (cid:2)lters maintain a local constraint, and a monitor only sends the coordinator an update of its\ndata when the constraint is violated. The coordinator thus receives an approximate, or (cid:147)perturbed,(cid:148)\nview of the data stream at each monitor and hence of the global state. We use stochastic matrix\nperturbation theory to analyze the effect on our PCA-based anomaly detector of using a perturbed\nglobal matrix. Based on this, we can choose the (cid:2)ltering parameters (i.e., the local constraints) so as\nto limit the effect of the perturbation on the PCA analysis and on any deterioration in the anomaly\ndetector\u2019s performance. All of these ideas are combined into a simple, adaptive distributed protocol.\n\n3.1 Overview of our approach\n\nFig. 2 illustrates the overall architecture of our system. We now describe the functionality at the\nmonitors and the coordinator. The goal of a monitor is to track its local raw time-series data, and to\ndecide when the coordinator needs an update. Intuitively, if the time series does not change much,\nor doesn\u2019t change in a way that affects the global condition being tracked, then the monitor does not\nsend anything to the coordinator. The coordinator assumes that the most recently received update\nis still approximately valid. The update message can be either the current value of the time series,\nor a summary of the most recent values, or any function of the time series. The update serves as a\nprediction of the future data, because should the monitor send nothing in subsequent time intervals,\nthen the coordinator uses the most recently received update to predict the missing values.\nFor our anomaly detection application, we (cid:2)lter as follows. Each monitor i maintains a (cid:2)ltering\nwindow Fi(t) of size 2(cid:14)i centered at a value Ri (i.e., Fi(t) = [Ri(t) (cid:0) (cid:14)i; Ri(t) + (cid:14)i]). At each\ntime t, the monitor sends both Yi(t) and Ri(t) to the coordinator only if Yi(t) =2 Fi, otherwise it\nsends nothing. The window parameter (cid:14)i is called the slack; it captures the amount the time series\ncan drift before an update to the coordinator needs to be sent. The center parameter Ri(t) denotes\nthe approximate representation, or summary, of Yi(t). In our implementation, we set Ri(t) equal\nto the average of last (cid:2)ve signal values observed locally at monitor i. Let t(cid:3) denote the time of the\nmost recent update happens. The monitor needs to send both Yi(t(cid:3)) and Ri(t(cid:3)) to the coordinator\nwhen it does an update, because the coordinator will use Yi(t(cid:3)) at time t(cid:3) and Ri(t(cid:3)) for all t > t(cid:3)\nuntil the next update arrives. For any subsequent t > t(cid:3) when the coordinator receives no update\nfrom that monitor, it will use Ri(t(cid:3)) as the prediction for Yi(t).\nThe role of the coordinator is twofold. First, it makes global anomaly-detection decisions based\nupon the received updates from the monitors. Secondly, it computes the (cid:2)ltering parameters (i.e., the\nslacks (cid:14)i) for all the monitors based on its view of the global state and the condition for triggering an\nanomaly. It gives the monitors their slacks initially and updates the value of their slack parameters\nwhen needed. Our protocol is thus adaptive. Due to lack of space we do not discuss here the\nmethod for deciding when slack updates are needed. The global detection task is the same as in the\n\n\fcentralized scheme. In contrast to the centralized setting, however, the coordinator does not have\nan exact version of the raw data matrix Y; it has the approximation ^Y instead. The PCA analysis,\nincluding the computation of Sab is done on the perturbed covariance matrix ^A := A (cid:0) (cid:1). The\nmagnitude of the perturbation matrix (cid:1) is determined by the slack variables (cid:14)i (i = 1; : : : ; M).\n\n3.2 Selection of (cid:2)ltering parameters\n\nA key ingredient of our framework is a practical method for choosing the slack parameters (cid:14)i. This\nchoice is critical because these parameters balance the tradeoff between the savings in data commu-\nnication and the loss of detection accuracy. Clearly, the larger the slack, the less the monitor needs\nto send, thus leading to both more reduction in communication overhead and potentially more in-\nformation loss at the coordinator. We employ stochastic matrix perturbation theory to quantify the\neffects of the perturbation of a matrix on key quantities such as eigenvalues and the eigen-subspaces,\nwhich in turn affect the detection accuracy.\nOur approach is as follows. We measure the size of a perturbation using a norm on (cid:1). We derive\nan upper bound on the changes to the eigenvalues (cid:21)i and the residual subspace Cab as a function of\nk(cid:1)k. We choose (cid:14)i to ensure that an approximation to this upper bound on (cid:1) is not exceeded. This\nin turn ensures that (cid:21)i and Cab do not exceed their upper bounds. Controlling these latter terms, we\nare able to bound the false alarm probability.\nRecall that the coordinator\u2019s view of the global data matrix is the perturbed matrix ^Y = Y + W,\nwhere all elements of the column vector Wi are bounded within the interval [(cid:0)(cid:14)i; (cid:14)i]. Let (cid:21)i and\n^(cid:21)i (i = 1; : : : ; n) denote the eigenvalues of the covariance matrix A = 1\nm YT Y and its perturbed\n^YT ^Y. Applying the classical theorems of Mirsky and Weyl [15], we obtain bounds\nversion ^A := 1\nm\non the eigenvalue perturbation in terms of the Frobenius norm k:kF and the spectral norm k:k2 of\n(cid:1) := A (cid:0) ^A, respectively:\n(cid:15)eig :=vuut\n1\nXi=1\nn\n\nApplying the sin theorem and results on bounding the angle of projections to subspaces [15] (see\n[3] for more details), we can bound the perturbation of the residual subspace Cab in terms of the\nFrobenius norm of (cid:1):\n\n(^(cid:21)i (cid:0) (cid:21)i)2 (cid:20) k(cid:1)kF =pn and max\n\nj^(cid:21)i (cid:0) (cid:21)ij (cid:20) k(cid:1)k2\n\n(1)\n\nn\n\ni\n\n(2)\n\nkCab (cid:0) ^CabkF (cid:20)\n\np2k(cid:1)kF\n\n(cid:23)\n\nwhere (cid:23) denotes the eigengap between the kth and (k +1)th eigenvalues of the estimated covariance\nmatrix ^A.\nTo obtain practical (i.e., computable) bound on the norms of (cid:1), we derive expectation bounds\ninstead of worst case bounds. We make the following assumptions on the error matrix W:\n\n1. The column vectors W1; : : : ; Wn are independent and radially symmetric m-vectors.\n2. For each i = 1; : : : ; n, all elements of column vector Wi are i.i.d. random variables with\n\nmean 0, variance (cid:27)2\n\ni := (cid:27)2\n\ni ((cid:14)i) and fourth moment (cid:22)4\n\ni := (cid:22)4\n\ni ((cid:14)i).\n\nNote that the independence assumption is imposed only on the error(cid:151)this by no means implies that\nthe signals received by different monitors are statistically independent. Under the above assumption,\n\nwe can show that k(cid:1)kF =pn is upper bounded in expectation by the following quantity:\n\nT olF = 2vuut\n\n1\nmn\n\nn\n\nXi=1\n\n(cid:21)i (cid:1)\n\nn\n\nXi=1\n\n(cid:27)2\n\ni +vuut(cid:18) 1\n\nm\n\n+\n\n1\n\nn(cid:19) n\nXi=1\n\n(cid:27)4\ni +\n\n1\nmn\n\nn\n\nXi=1\n\n((cid:22)4\n\ni (cid:0) (cid:27)4\ni ):\n\n(3)\n\nSimilar results can be obtained for the spectral norm as well. In practice, these upper bounds are\nvery tight because (cid:27)1; : : : ; (cid:27)n tend to be small compared to the top eigenvalues. Given the tolerable\nperturbation T olF , we can use Eqn. (3) to select the slack variables. For example, we can divide the\noverall tolerance across monitors either uniformly or in proportion to their observed local variance.\n\n\f3.3 Guarantee on false alarm probability\n\nBecause our approximation perturbs the eigenvalues, it also impacts the accuracy with which the\ntrigger is (cid:2)red. Since the trigger condition is kCabyk2 > Q(cid:11), we must assess the impact on both\nof these terms. We can compute an upper bound on the perturbation of the SPE statistic, SPE =\nkCabyk2, as follows. First, note that\njk ^Cab ^yk (cid:0) kCabykj (cid:20) k( ^Cab (cid:0) Cab)^yk + kCab(y (cid:0) ^y)k (cid:20)\np2k(cid:1)kF\n\n+ kCabk2vuut\n\np2k(cid:1)kFk^yk\n\nXi=1\n\n(cid:14)2\ni =: (cid:17)1(^y):\n\nn\n\nn\n\n(cid:14)2\ni\n\n(cid:20)\n\n(cid:23)\n\n+ k ^Cabk +\n\n(cid:23)\n\n(cid:23)\n\np2k(cid:1)kFk^yk\n!vuut\n\nXi=1\n\njk ^Cab ^yk2 (cid:0) kCabyk2j (cid:20) (cid:17)1(^y)(2k ^Cab ^yk + (cid:17)1(^y)) =: (cid:17)2(^y):\n\nThe dependency of the threshold Q(cid:11) on the eigenvalues, (cid:21)k+1; : : : ; (cid:21)n, can be expressed as [4]:\n\nQ(cid:11) = (cid:30)1\" c(cid:11)p2(cid:30)2h2\n\n(cid:30)1\n\n0\n\n+ 1 +\n\n(cid:30)2h0(h0 (cid:0) 1)\n\n(cid:30)2\n1\n\n1\nh0\n\n;\n\n#\n\nj=k+1 (cid:21)i\n\nwhere c(cid:11) is the (1 (cid:0) (cid:11))-percentile of the standard normal distribution, h0 = 1 (cid:0) 2(cid:30)1(cid:30)3\nPn\n\nTo assess the perturbation in false alarm probability, we start by considering the following random\nvariable c derived from Eqn. (5):\n\nj for i = 1; 2; 3.\n\n, (cid:30)i =\n\n3(cid:30)2\n\n2\n\n(cid:30)1[(SPE=(cid:30)1)h0 (cid:0) 1 (cid:0) (cid:30)2h0(h0 (cid:0) 1)=(cid:30)2\n1]\n\nc =\n\n:\n\n(6)\n\n(4)\n\n(5)\n\n0\n\np2(cid:30)2h2\n\nThe random variable c essentially normalizes the random quantity kCabyk2 and is known to ap-\nproximately follow a standard normal distribution [5]. The false alarm probability in the centralized\nsystem is expressed as\n\nPr(cid:2)kCabyk2 > Q(cid:11)(cid:3) = Pr [c > c(cid:11)] = (cid:11);\n\nwhere the lefthand term of this equation is conditioned upon the SPE statistics being inside the\nnormal range. In our distributed setting, the anomaly detector (cid:2)res a trigger if k ^Cab ^yk2 > ^Q(cid:11).\nWe thus only observe a perturbed version ^c for the random variable c. Let (cid:17)c denote the bound on\nj^c (cid:0) cj. The deviation of the false alarm probability in our approximate detection scheme can then\nbe approximated as P (c(cid:11) (cid:0) (cid:17)c < U < c(cid:11) + (cid:17)c), where U is a standard normal random variable.\n4 Evaluation\n\nWe implemented our algorithm and developed a trace-driven simulator to validate our methods. We\nused a one-week trace collected from the Abilene network1. The traces contains per-link traf(cid:2)c\nloads measured every 10 minutes, for all 41 links of the Abilene network. With a time unit of 10\nminutes, data was collected for 1008 time units. This data was used to feed the simulator. There\nare 7 anomalies in the data that were detected by the centralized algorithm (and veri(cid:2)ed by hand\nto be true anomalies). We also injected 70 synthetic anomalies into this dataset using the method\ndescribed in [8], so that we would have suf(cid:2)cient data to compute error rates. We used a threshold\nQ(cid:11) corresponding to an 1 (cid:0) (cid:11) = 99:5% con(cid:2)dence level. Due to space limitations, we present\nresults only for the case of uniform monitor slack, (cid:14)i = (cid:14).\nThe input parameter for our algorithm is the tolerable relative error of the eigenvalues ((cid:147)relative\ni , where T olF\nis de(cid:2)ned in Eqn. (3).) Given this parameter and the input data we can compute the (cid:2)ltering slack (cid:14)\nfor the monitors using Eqn. (3). We then feed in the data to run our protocol in the simulator with the\n\neigen-error(cid:148) for short), which acts as a tuning knob. (Precisely, it is T olF =q 1\n\nnP (cid:21)2\n\n1Abilene is an Internet2 high-performance backbone network that interconnects a large number of universi-\n\nties as well as a few other research institutes.\n\n\fx 107\n\n6\n\n4\n\n2\n\nk\nc\na\nS\n\nl\n\n0\n\n0\n\n0.015\n\nr\no\nr\nr\n\n \n\nE\nn\ne\ng\nE\n\ni\n\n \n.\nl\n\ne\nR\n\n0.01\n\n0.005\n\n0\n\n0\n\n0.005\n\n0.01\n\n0.005\n\n0.01\n\nr\no\nr\nr\n\n0.1\n\nl\n\n \n\nE\nd\no\nh\ns\ne\nr\nh\nT\n\n0.05\n\n \n.\nl\n\ne\nR\n\n0\n\n0\n\n0.005\n\n0.01\n\n0.015\n(a)\n\n0.015\n(b)\n\n0.015\n(c)\n\n \n\nt\n\ne\na\nR\nm\nr\na\nA\n\nl\n\n0.02\n\n0.025\n\n0.03\n\n \n.\nl\n\na\nF\n\nt\n\ne\na\nR\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n0.1\n\nUpper Bound\nActual Accrued\n\n0.005\n\n0.01\n\n0.05\n\nt\n\n \n.\nc\ne\ne\nD\nd\ne\ns\ns\nM\n\n \n\ni\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n0.005\n\n0.01\n\n0.005\n\n0.01\n\nd\na\ne\nh\nr\ne\nv\nO\n\n \n.\n\nm\nm\no\nC\n\n0.02\n\n0.025\n\n0.03\n\n0.02\n\n0.025\n\n0.03\n\n0.015\n(d)\n\n0.015\n(e)\n\n0.015\n(f)\n\n0.02\n\n0.025\n\n0.03\n\n0.02\n\n0.025\n\n0.03\n\n0.02\n\n0.025\n\n0.03\n\nFigure 3: In all plots the x-axis is the relative eigen-error. (a) The (cid:2)ltering slack. (b) Actual accrued eigen-\nerror. (c) Relative error of detection threshold. (d) False alarm rates. (e) Missed detection rates. (f) Communi-\ncation overhead.\n\ncomputed (cid:14). The simulator outputs a set of results including: 1) the actual relative eigen errors and\nthe relative errors on the detection threshold Q(cid:11); 2) the missed detection rate, false alarm rate and\ncommunication cost achieved by our method. The missed-detection rate is de(cid:2)ned as the fraction of\nmissed detections over the total number of real anomalies, and the false-alarm rate as the fraction\nof false alarms over the total number of detected anomalies by our protocol, which is (cid:11) (de(cid:2)ned in\nSec. 3.3) rescaled as a rate rather than a probability. The communication cost is computed as the\nfraction of number of messages that actually get through the (cid:2)ltering window to the coordinator.\nThe results are shown in Fig. 3. In all plots, the x-axis is the relative eigen-error. In Fig. 3(a) we plot\nthe relationship between the relative eigen-error and the (cid:2)ltering slack (cid:14) when assuming (cid:2)ltering\nerrors are uniformly distributed on interval [(cid:0)(cid:14); (cid:14)]. With this model, the relationship between the\ni = (cid:14)2\n3 ).\nrelative eigen-error and the slack is determined by a simpli(cid:2)ed version of Eqn. (3) (with all (cid:27) 2\nThe results make intuitive sense. As we increase our error tolerance, we can (cid:2)lter more at the monitor\nand send less to the coordinator. The slack increases almost linearly with the relative eigen-error\nbecause the (cid:2)rst term in the right hand side of Eqn. (3) dominates all other terms.\nIn Fig. 3(b) we compare the relative eigen-error to the actual accrued relative eigen-error (de(cid:2)ned as\ni , where (cid:15)eig is de(cid:2)ned in Eqn (1)). These were computed using the slack parameters\n(cid:14) as computed by our coordinator. We can see that the real accrued eigen-errors are always less than\nthe tolerable eigen errors. The plot shows a tight upper bound, indicating that it is safe to use our\nmodel\u2019s derived (cid:2)ltering slack (cid:14). In other words, the achieved eigen-error always remains below the\nrequested tolerable error speci(cid:2)ed as input, and the slack chosen given the tolerable error is close\nto being optimal. Fig. 3(c) shows the relationship between the relative eigen-error and the relative\n2. We see that the threshold for detecting anomalies decreases as we\nerror of detection threshold Q(cid:11)\ntolerate more and more eigen-errors. In these experiments, an error of 2% in the eigenvalues leads\nto an error of approximately 6% in our estimate of the appropriate cutoff threshold.\nWe now examine the false alarm rates achieved. In Fig. 3(d) the curve with triangles represents\nthe upper bound on the false alarm rate as estimated by the coordinator. The curve with circles\nis the actual accrued false alarm rate achieved by our scheme. Note that the upper bound on the\nfalse alarm rate is fairly close to the true values, especially when the slack is small. The false alarm\nrate increases with increasing eigen-error because as the eigen-error increases, the corresponding\ndetection threshold Q(cid:11) will decrease, which in turn causes the protocol to raise an alarm more\noften. (If we had plotted ^Q rather than the relative threshold difference, we would obviously see a\n\n(cid:15)eig=q 1\n\nnP (cid:21)2\n\n2Precisely, it is 1 (cid:0) ^Q(cid:11)=Q(cid:11), where ^Q(cid:11) is computed from ^(cid:21)k+1; : : : ; ^(cid:21)n.\n\n\fdecreasing ^Q with increasing eigen-error.) We see in Fig. 3(e) that the missed detection rates remain\nbelow 4% for various levels of communication overhead.\nThe communication overhead is depicted in Fig. 3(f). Clearly, the larger the errors we can tolerate,\nthe more overhead can be reduced. Considering these last three plots (d,e,f) together, we observe\nseveral tradeoffs. For example, when the relative eigen-error is 1:5%, our algorithm reduces the data\nsent through the network by more than 90%. This gain is achieved at the cost of approximately a\n4% missed detection rate and a 6% false alarm rate. This is a large reduction in communication for\na small increase in detection error. These initial results illustrate that our in-network solution can\ndramatically lower the communication overhead while still achieving high detection accuracy.\n\n5 Conclusion\n\nWe have presented a new algorithmic framework for network anomaly detection that combines dis-\ntributed tracking with PCA analysis to detect anomalies with far less data than previous methods.\nThe distributed tracking consists of local (cid:2)lters, installed at each monitoring site, whose parameters\nare selected based upon global criteria. The idea is to track the local monitoring data only enough so\nas to enable accurate detection. The local (cid:2)ltering reduces the amount of data transmitted through\nthe network but also means that anomaly detection must be done with limited or partial views of the\nglobal state. Using methods from stochastic matrix perturbation theory, we provided an analysis for\nthe tradeoff between the detection accuracy and the data communication overhead. We were able\nto control the amount of data overhead using the the relative eigen-error as a tuning knob. To the\nbest of our knowledge, this is the (cid:2)rst result in the literature that provides upper bounds on the false\nalarm rate of network anomaly detection.\n\nReferences\n[1] BAI, Z.-J., CHAN, R. AND LUK, F. Principal component analysis for distributed data sets with updating.\n\nIn Proceedings of International workshop on Advanced Parallel Processing Technologies (APPT), 2005.\n\n[2] DREGER, H., FELDMANN, A., PAXSON, V. AND SOMMER, R. Operational experiences with high-\nvolume network intrusion detection. In Proceedings of ACM Conference on Computer and Communications\nSecurity (CCS), 2004.\n\n[3] HUANG, L., NGUYEN, X., GAROFALAKIS, M., JORDAN, M., JOSEPH, A. AND TAFT, N. In-network\nPCA and anomaly detection. Technical Report No. UCB/EECS-2007-10, EECS Department, UC Berkeley.\n[4] JACKSON, J. E. AND MUDHOLKAR, G. S. Control procedures for residuals associated with principal\n\ncomponent analysis. In Technometrics, 21(3):341-349, 1979.\n\n[5] JENSEN, D. R. AND SOLOMON, H. A Gaussian approximation for the distribution of de(cid:2)nite quadratic\n\nforms. In Journal of the American Statistical Association, 67(340):898-902, 1972.\n\n[6] KERALAPURA, R., CORMODE, G. AND RAMAMIRTHAM, J. Communication-ef(cid:2)cient distributed mon-\nitoring of thresholded counts. In Proceedings of ACM International Conference on Management of Data\n(SIGMOD), 2006.\n\n[7] KREIDL, P. O., WILLSKY, A. Inference with minimal communication: A decision-theoretic variational\n\napproach. In Proceedings of Neural Information Processing Systems (NIPS), 2006.\n\n[8] LAKHINA, A., CROVELLA, M. AND DIOT, C. Diagnosing network-wide traf(cid:2)c anomalies. In Proceedings\n\nof ACM Conference of the Special Interest Group on Data Communication (SIGCOMM), 2004.\n\n[9] LAKHINA, A., PAPAGIANNAKI, K., CROVELLA, M., DIOT, C., KOLACZYK, E. D. AND TAFT, N.\nStructural analysis of network traf(cid:2)c (cid:3)ows. In Proceedings of International Conference on Measurement\nand Modeling of Computer Systems (SIGMETRICS), 2004.\n\n[10] LEVCHENKO, K., PATURI, R. AND VARGHESE, G. On the dif(cid:2)culty of scalably detecting network\n\nattacks. In Proceedings of ACM Conference on Computer and Communications Security (CCS), 2004.\n\n[11] NGUYEN, X., WAINWRIGHT, M. AND JORDAN, M. Nonparametric decentralized detection using kernel\n\nmethods. In IEEE Transactions on Signal Processing, 53(11):4053-4066, 2005.\n\n[12] PADMANABHAN, V. N., RAMABHADRAN, S., AND PADHYE, J. Netpro(cid:2)ler: Pro(cid:2)ling wide-area net-\n\nworks using peer cooperation. In Proceedings of International Workshop on Peer-to-Peer Systems, 2005.\n\n[13] PREDD, J.B., KULKARNI, S.B., AND POOR, H.V. Distributed learning in wireless sensor networks. In\n\nIEEE Signal Processing Magazine, 23(4):56-69, 2006.\n\n[14] QU, Y., OSTROUCHOVZ, G., SAMATOVAZ, N AND GEIST, A. Principal component analysis for dimen-\nsion reduction in massive distributed data sets. In Proceedings of IEEE International Conference on Data\nMining (ICDM), 2002.\n\n[15] STEWART, G. W., AND SUN, J.-G. Matrix Perturbation Theory. Academic Press, 1990.\n[16] YEGNESWARAN, V., BARFORD, P., AND JHA, S. Global intrusion detection in the domino overlay\n\nsystem. In Proceedings of Network and Distributed System Security Symposium (NDSS), 2004.\n\n[17] ZHANG, Y., GE, Z.-H., GREENBERG, A., AND ROUGHAN, M. Network anomography. In Proceedings\n\nof Internet Measurement Conference (IMC), 2005.\n\n\f", "award": [], "sourceid": 3156, "authors": [{"given_name": "Ling", "family_name": "Huang", "institution": null}, {"given_name": "XuanLong", "family_name": "Nguyen", "institution": null}, {"given_name": "Minos", "family_name": "Garofalakis", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}, {"given_name": "Anthony", "family_name": "Joseph", "institution": null}, {"given_name": "Nina", "family_name": "Taft", "institution": null}]}