{"title": "Tracking Dynamic Sources of Malicious Activity at Internet Scale", "book": "Advances in Neural Information Processing Systems", "page_first": 1946, "page_last": 1954, "abstract": "We formulate and address the problem of discovering dynamic malicious regions on the Internet. We model this problem as one of adaptively pruning a known decision tree, but with additional challenges: (1) severe space requirements, since the underlying decision tree has over 4 billion leaves, and (2) a changing target function, since malicious activity on the Internet is dynamic. We present a novel algorithm that addresses this problem, by putting together a number of different ``experts algorithms and online paging algorithms. We prove guarantees on our algorithms performance as a function of the best possible pruning of a similar size, and our experiments show that our algorithm achieves high accuracy on large real-world data sets, with significant improvements over existing approaches.", "full_text": "Tracking Dynamic Sources of Malicious Activity at\n\nInternet-Scale\n\nShobha Venkataraman\u2217, Avrim Blum\u2020, Dawn Song\u22c4, Subhabrata Sen\u2217, Oliver Spatscheck\u2217\n\n\u2217 AT&T Labs \u2013 Research\n\n{shvenk,sen,spatsch}@research.att.com\n\n\u2020Carnegie Mellon University\n\n\u22c4University of California, Berkeley\n\navrim@cs.cmu.edu\n\ndawnsong@cs.berkeley.edu\n\nAbstract\n\nWe formulate and address the problem of discovering dynamic malicious regions\non the Internet. We model this problem as one of adaptively pruning a known\ndecision tree, but with additional challenges: (1) severe space requirements, since\nthe underlying decision tree has over 4 billion leaves, and (2) a changing target\nfunction, since malicious activity on the Internet is dynamic. We present a novel\nalgorithm that addresses this problem, by putting together a number of different\n\u201cexperts\u201d algorithms and online paging algorithms. We prove guarantees on our\nalgorithm\u2019s performance as a function of the best possible pruning of a similar\nsize, and our experiments show that our algorithm achieves high accuracy on large\nreal-world data sets, with signi\ufb01cant improvements over existing approaches.\n\n1 Introduction\nIt is widely acknowledged that identifying the regions that originate malicious traf\ufb01c on the Internet\nis vital to network security and management, e.g., in throttling attack traf\ufb01c for fast mitigation, iso-\nlating infected sub-networks, and predicting future attacks [6, 18, 19, 24, 26]. In this paper, we show\nhow this problem can be modeled as a version of a question studied by Helmbold and Schapire [11]\nof adaptively learning a good pruning of a known decision tree, but with a number of additional chal-\nlenges and dif\ufb01culties. These include a changing target function and severe space requirements due\nto the enormity of the underlying IP address-space tree. We develop new algorithms able to address\nthese dif\ufb01culties that combine the underlying approach of [11] with the sleeping experts framework\nof [4, 10] and the online paging problem of [20]. We show how to deal with a number of practical\nissues that arise and demonstrate empirically on real-world datasets that this method substantially\nimproves over existing approaches of /24 pre\ufb01xes and network-aware clusters [6,19,24] in correctly\nidentifying malicious traf\ufb01c. Our experiments on data sets of 126 million IP addresses demonstrate\nthat our algorithm is able to achieve a clustering that is both highly accurate and meaningful.\n\n1.1 Background\nMultiple measurement studies have indicated that malicious traf\ufb01c tends to cluster in a way that\naligns with the structure of the IP address space, and that this is true for many different kinds of\nmalicious traf\ufb01c \u2013 spam, scanning, botnets, and phishing [6, 18, 19, 24]. Such clustered behaviour\ncan be easily explained: most malicious traf\ufb01c originates from hosts in poorly-managed networks,\nand networks are typically assigned contiguous blocks of the IP address space. Thus, it is natural\nthat malicious traf\ufb01c is clustered in parts of the IP address space that belong to poorly-managed\nnetworks.\nFrom a machine learning perspective, the problem of identifying regions of malicious activity can\nbe viewed as one of \ufb01nding a good pruning of a known decision tree \u2013 the IP address space may be\nnaturally interpreted as a binary tree (see Fig.1(a)), and the goal is to learn a pruning of this tree that\nis not too large and has low error in classifying IP addresses as malicious or non-malicious. The\nstructure of the IP address space suggests that there may well be a pruning with only a modest num-\nber of leaves that can classify most of the traf\ufb01c accurately. Thus, identifying regions of malicious\nactivity from an online stream of labeled data is much like the problem considered by Helmbold and\nSchapire [11] of adaptively learning a good pruning of a known decision tree. However, there are a\n\n1\n\n\fnumber of real-world challenges, both conceptual and practical, that must be addressed in order to\nmake this successful.\nOne major challenge in our application comes from the scale of the data and size of a complete\ndecision tree over the IP address space. A full decision tree over the IPv4 address space would\nhave 232 leaves, and over the IPv6 address space (which is slowly being rolled out), 2128 leaves.\nWith such large decision trees, it is critical to have algorithms that do not build the complete tree,\nbut instead operate in space comparable to the size of a good pruning. These space constraints are\nalso important because of the volume of traf\ufb01c that may need to be analyzed \u2013 ISPs often collect\nterabytes of data daily and an algorithm that needs to store all its data in memory simultaneously\nwould be infeasible.\n\nA second challenge comes from the fact that the regions of malicious activity may shift longitu-\ndinally over time [25]. This may happen for many reasons, e.g., administrators may eventually\ndiscover and clean up already infected bots, and attackers may target new vulnerabilities and attack\nnew hosts elsewhere. Such dynamic behaviour is a primary reason why individual IP addresses tend\nto be such poor indicators of future malicious traf\ufb01c [15, 26]. Thus, we cannot assume that the data\ncomes from a \ufb01xed distribution over the IP address space; the algorithm needs to adapt to dynamic\nnature of the malicious activity, and track these changes accurately and quickly. That is, we must\nconsider not only an online sequence of examples but also a changing target function.\nWhile there have been a number of measurement studies [6,18, 19,24] that have examined the origin\nof malicious traf\ufb01c from IP address blocks that are kept \ufb01xed apriori, none of these have focused on\ndeveloping online algorithms that \ufb01nd the best predictive IP address tree. Our challenge is to develop\nan ef\ufb01cient high-accuracy online algorithm that handles the severe space constraints inherent in this\nproblem and accounts for the dynamically changing nature of malicious behavior. We show that\nwe can indeed do this, both proving theoretical guarantees on adaptive regret and demonstrating\nsuccessful performance on real-world data.\n\n1.2 Contributions\nIn this paper, we formulate and address the problem of discovering and tracking malicious regions of\nthe IP address space from an online stream of data. We present an algorithm that adaptively prunes\nthe IP address tree in a way that maintains at most m leaves and performs nearly as well as the\noptimum adaptive pruning of the IP address tree with a comparable size. Intuitively, we achieve the\nrequired adaptivity and the space constraints by combining several \u201cexperts\u201d algorithms together\nwith a tree-based version of paging. Our theoretical results prove that our algorithm can predict\nnearly as well as the best adaptive decision tree with k leaves when using O(k log k) leaves.\nOur experimental results demonstrate that our algorithm identi\ufb01es malicious regions of the IP ad-\ndress space accurately, with orders of magnitude improvement over previous approaches. Our ex-\nperiments focus on classifying spammers and legitimate senders on two mail data sets, one with 126\nmillion messages collected over 38 days from the mail servers of a tier-1 ISP, and a second with\n28 million messages collected over 6 months from an enterprise mail server. Our experiments also\nhighlight the importance of allowing the IP address tree to be dynamic, and the resulting view of the\nIP address space that we get is both compelling and meaningful.\n\n2 De\ufb01nitions and Preliminaries\nWe now present some basic de\ufb01nitions as well as our formal problem statement.\n\nThe IP address hierarchy can be naturally interpreted as a full binary tree, as in Fig. 1: the leaves of\nthe tree correspond to individual IP addresses, and the non-leaf nodes correspond to the remaining\nIP pre\ufb01xes. Let P denote the set of all IP pre\ufb01xes, and I denote the set of all IP addresses. We also\nuse term clusters to denote the IP pre\ufb01xes.\nWe de\ufb01ne an IPTree TP to be a pruning of the full IP address tree: a tree whose nodes are IP\npre\ufb01xes P \u2208 P, and whose leaves are each associated with a label, i.e., malicious or non-malicious.\nAn IPtree can thus be interpreted as a classi\ufb01cation function for the IP addresses I: an IP address i\ngets the label associated with its longest matching pre\ufb01x in P . Fig. 1 shows an example of an IPtree.\nWe de\ufb01ne the size of an IPtree to be the number of leaves it has. For example, in Fig. 1(a), the size\nof the IPtree is 6.\nAs described in Sec. 1, we focus on online learning in this paper. A typical point of comparison\nused in the online learning model is the error of the optimal of\ufb02ine \ufb01xed algorithm. In this case,\nthe optimal of\ufb02ine \ufb01xed algorithm is the IPtree of a given size k i.e., the tree of size k that makes\n\n2\n\n\f0.0.0.0/0\n\n0.0.0.0/1\n\n0.0.0.0/2\n\n+\n\n-\n\n128.0.0.0/4\n\n+\n\n-\n\n128.0.0.0/1\n\n192.0.0.0/2\n\n+\n160.0.0.0/3\n\n+\n152.0.0.0/4\n\n(a) An example IPTree\n\n(b) A real IPTree (Color coding explained in Sec. 5)\n\nFigure 1: IPTrees: example and real. Recall that an IP address is interpreted as a 32-bit string, read\nfrom left to right. This de\ufb01nes a path on the binary tree, going left for 0 and right for 1. An IP pre\ufb01x\nis denoted by IP/n, where n indicates the number of bits relevant to the pre\ufb01x.\n\nthe fewest mistakes on the entire sequence. However, if the true underlying IPtree may change over\ntime, a better point of comparison would allow the of\ufb02ine tree to also change over time. To make\nsuch a comparison meaningful, the of\ufb02ine tree must pay an additional penalty each time it changes\n(otherwise the of\ufb02ine tree would not be a meaningful point of comparison \u2013 it could change for each\nIP address in the sequence, and thus make no mistakes). We therefore limit the kinds of changes the\nof\ufb02ine tree can make, and compare the performance of our algorithm to every IPtree with k leaves,\nas a function of the errors it makes and the changes it makes.\nWe de\ufb01ne an adaptive IPtree of size k to be an adaptive tree that can (a) grow nodes over time so\nlong as it never has more than k leaves, (b) change the labels of its leaf nodes, and (c) occasionally\nrecon\ufb01gure itself completely. Our goal is to develop an online algorithm T such that for any se-\nquence of IP addresses, (1) for every adaptive tree T \u2032 of size k, the number of mistakes made by T\nis bounded by a (small) function of the mistakes and the changes of types (a), (b), and (c) made by\nT \u2032, and (2) T uses no more than \u02dcO(k) space. In the next section, we describe an algorithm meeting\nthese requirements.\n\n3 Algorithms and Analysis\nIn this section, we describe our main algorithm TrackIPTree, and present theoretical guarantees on\nits performance. At a high-level, our approach keeps a number of experts in each pre\ufb01x of the\nIPtree, and combines their predictions to classify every IP address. The inherent structure in the\nIPtree allows us to decompose the problem into a number of expert problems, and provide lower\nmemory bounds and better guarantees than earlier approaches.\nWe begin with an overview. De\ufb01ne the path-nodes of an IP address to be the set of all pre\ufb01xes of i\nin T , and denote this set by Pi,T . To predict the label of an IP i, the algorithm looks up all the path-\nnodes in Pi,T , considers their predictions, and combines these predictions to produce a \ufb01nal label\nfor i. To update the tree, the algorithm rewards the path-nodes that predicted correctly, penalizes the\nincorrect ones, and modi\ufb01es the tree structure if necessary.\nTo \ufb01ll out this overview, there are four technical questions that we need to address: (1) Of all the\npath-nodes in Pi,T , how do we learn the ones that are the most important? (2) How do we learn the\ncorrect label to predict at a particular path-node in Pi,T (i.e., positive or negative)? (3) How do we\ngrow the IPtree appropriately, ensuring that it grows primarily the pre\ufb01xes needed to improve the\nclassi\ufb01cation accuracy? (4) How do we ensure that the size of the IPtree stays bounded by m? We\naddress these questions by treating them as separate subproblems, and we show how they \ufb01t together\nto become the complete algorithm in Figure 3.1.\n\n3.1 Subproblems of TrackIPTree\nWe now describe our algorithm in detail. Since our algorithm decomposes naturally into the four\nsubproblems mentioned above, we focus on each subproblem separately to simplify the presentation.\nWe use the following notation in our descriptions: Recall from Sec. 2 that m is the maximum number\nof leaves allowed to our algorithm, k is the size of the optimal of\ufb02ine tree, and Pi,T denotes the set\nof path-nodes, i.e., the pre\ufb01xes of IP i in the current IPtree T .\nRelative Importance of the Path Nodes First, we consider the problem of deciding which of the\npre\ufb01x nodes in the path Pi,T is most important. We formulate this as a sleeping experts problem [4,\n10]. We set an expert in each node, and call them the path-node experts, and for an IP i, we consider\nthe set of path-node experts in Pi,T to be the \u201cawake\u201d experts, and the rest to be \u201casleep\u201d. The\n\n3\n\n\fx0\n\nx2\n\nx3\n\nx4\n\nx5\n\nx1\n\nx6\n\ny+\n\ny-\n\n(a) Sleeping Experts: Relative Importance\n\n(b) Shifting Experts: Determining\n\nof Path-Nodes\n\nNode Labels\nFigure 2: Decomposing the TrackIPTree Algorithm\n\nsleeping experts algorithm makes predictions using the awake experts, and intuitively, has the goal\nof predicting nearly as well as the best awake expert on the instance i 1. In our context, the best\nawake expert on the IP i corresponds to the pre\ufb01x of i in the optimal IPtree, which remains sleeping\nuntil the IPtree grows that pre\ufb01x. Fig. 2(a) illustrates the sleeping experts framework in our context:\nthe shaded nodes are \u201cawake\u201d and the rest are \u201casleep\u201d.\nxt.\nSpeci\ufb01cally, let xt denote the weight of the path-node expert at node t, and let Si,T = Pt\u2208Pi,T\nTo predict on IP address i, the algorithm chooses the expert at node t with probability xt/Si,T . To\nupdate, the algorithm penalizes all incorrect experts in Pi,T , reducing their weight xt to \u03b3xt. (e.g.,\n\u03b3 = 0.8). It then renormalizes the weights of all the experts in Pi,T so that their sum Si,T does not\nchange. (In our proof, we use a slightly different version of the sleeping experts algorithm [4]).\nDeciding Labels of Individual Nodes Next, we need to decide whether the path-node expert at\na node n should predict positive or negative. We use a different experts algorithm to address this\nsubproblem \u2013 the shifting experts algorithm [12]. Speci\ufb01cally, we allow each node n to have two\nadditional experts \u2013 a positive expert, which always predicts positive, and a negative expert, which\nalways predicts negative. We call these experts node-label experts.\nLet yn,+ and yn,\u2212 denote the weights of the positive and negative node-label experts respectively,\nwith yn,\u2212 + yn,+ = 1. The algorithm operates as follows: to predict, the node predicts positive with\nprobability yn,+ and negative with probability yn,\u2212. To update, when the node receives a label, it\nincreases the weight of the correct node-label expert by \u01eb, and decreases the weight of the incorrect\nnode-label expert by \u01eb (upto a maximum of 1 and a minimum of 0). Note that this algorithm naturally\nadapts when a leaf of the optimal IPtree switches labels \u2013 the relevant node in our IPtree will slowly\n\u01eb mistakes\nshift weights from the incorrect node-label expert to the correct one, making an expected 1\nin the process. Fig. 2(b) illustrates the shifting experts setting on an IPtree: each node has two\nexperts, a positive and a negative. Fig. 3 shows how it \ufb01ts in with the sleeping experts algorithm.\nBuilding Tree Structure We next address the subproblem of building the appropriate structure for\nthe IPtree. The intuition here is: when a node in the IPtree makes many mistakes, then either\nthat node has a subtree in the optimal IPtree that separates the positive and negative instances,\nor the optimal IPtree must also make the same mistakes. Since TrackIPTree cannot distinguish\nbetween these two situations, it simply splits any node that makes suf\ufb01cient mistakes. In particular,\nTrackIPTree starts with only the root node, and tracks the number of mistakes made at every node.\n\u01eb mistakes, TrackIPTree splits that leaf into its children, and instantiates\nEvery time a leaf makes 1\nand initializes the relevant path-node experts and node-label experts of the children. In effect, it is\nas if the path-node experts of the children had been asleep till this point, but will now be \u201cawake\u201d\nfor the appropriate IP addresses.\n\n\u01eb mistakes at each node before growing it, so that there is a little resilence\nTrackIPTree waits for 1\nwith noisy data \u2013 otherwise, it would split a node every time the optimal tree made a mistake, and the\nIPtree would grow very quickly. Note also that it naturally incorporates the optimal IPtree growing\na leaf; our tree will grow the appropriate nodes when that leaf has made 1\n\n\u01eb mistakes.\n\nBounding Size of IPtree Since TrackIPTree splits any node after it makes 1\n\u01eb mistakes, it is likely\nthat the IPtree it builds is split much farther than the optimal IPtree \u2013 TrackIPTree does not know\nwhen to stop growing a subtree, and it splits even if the same mistakes are made by the optimal\nIPtree. While this excessive splitting does not impact the predictions of the path-node experts or the\nnode-label experts signi\ufb01cantly, we still need to ensure that the IPtree built by our algorithm does\nnot become too large.\n\n1We leave the exact statement of the guarantee to the proof in [23]\n\n4\n\n\fTRACKIPTREE\nInput: tree size m, learning rate \u01eb, penalty factor \u03b3\nInitialize:\n\nSet T := root\nInitializeNode(root)\n\nPrediction Rule: Given IP i\n\n//Select a node-label expert\nfor n \u2208 Pi,T\n\n\ufb02ip coin of bias yn,+\nif heads, predict[n] := +\nelse predict[n] := \u2212\n//Select a path-node expert\nrval := predict[n] with weight\n\nxn/ Pt\u2208P xt\n\nReturn rval\n\nUpdate Rule: Given IP i, label r\n\n//Update node-label experts\nfor n \u2208 Pi,T\n\nfor label z \u2208 {+, \u2212}\n\nif z = r, yn,z := yn,z + \u01eb\nelse yn,z := yn,z \u2212 \u01eb\n\nUpdate Rule (Contd.):\n\n//Update path-node experts\ns := Pn\u2208Pt,T\nfor n \u2208 Pi,T\nif predict[n] 6= r,\n\nxn\n\npenalize xn := \u03b3xn\nmistakes[xn] + +\nif mistakes[xn] > 1/\u01eb and n\n\nRenormalize xn := xn\n\nis leaf, GrowT ree(n)\ns\nPj\u2208Pi,T\n\nxj\n\nsub INITIALIZENODE\nInput: node t\n\nxt := 1; yt,+ := yt,\u2212 := 0.5\nmistakes[t] := 0\n\nsub GROWTREE\nInput: leaf l\n\nif size(T ) \u2265 m\n\nSelect nodes N to discard with\n\npaging algorithm\n\nSplit leaf l into children lc, rc.\nInitializeNode(lc), InitializeNode(rc)\n\nFigure 3: The Complete TrackIPTree Algorithm\n\nWe do this by framing it as a paging problem [20]: consider each node in the IPtree to be a page,\nand the maximum allowed nodes in the IPtree to be the size of the cache. The of\ufb02ine IPtree, which\nhas k leaves, needs a cache of size 2k. The IPtree built by our algorithm may have at most m leaves\n(and thus, 2m nodes, since it is a binary tree), and so the size of its cache is 2m and the of\ufb02ine\ncache is 2k. We may then select nodes to be discarded as if they were pages in the cache once the\nIPtree grows beyond 2m nodes; so, for example, we may choose the least recently used nodes in\nthe IPtree, with LRU as the paging algorithm. Our analysis shows that setting m = O( k\n\u01eb2 log k\n\u01eb )\nsuf\ufb01ces, when TrackIPTree uses FLUSH-WHEN-FULL (FWF) as its paging algorithm \u2013 this is a\nsimple paging algorithm that discards all the pages in the cache when the cache is full, and restarts\nwith an empty cache. We use FWF here for a clean analysis, and especially since in simple paging\nmodels, many algorithms achieve no better guarantees [20]. For our experiments, we implement\nLRU, and our results show that this approach, while perhaps not sophisticated, still maintains an\naccurate predictive IPtree.\n\n\u01eb2 log k\n\n3.2 Analysis\nIn this section, we present theoretical guarantees on TrackIPTree\u2019s performance. We show our\nalgorithm performs nearly as well as best adaptive k-IPtree, bounding the number of mistakes made\nby our algorithm as a function of the number of mistakes, number of labels changes and number of\ncomplete recon\ufb01gurations of the optimal such tree in hindsight.\nTheorem 3.1 Fix k. Set the maximum number of leaves allowed to the TrackIPTree algorithm m to\nbe 10k\n\u01eb . Let T be an adaptive k-IPtree. Let \u2206T,z denote the number of times T changes labels\non the its leaves over the sequence z, and RT,z denote the number of times times T has completely\nrecon\ufb01gured itself over z.\nThe algorithm TrackIPTreeensures that on any sequence of instances z, for each T , the number of\n\u01eb (RT,z + 1) with\nmistakes made by TrackIPTree is at most (1 + 3\u01eb)MT,z + ( 1\nprobability at least 1 \u2212 (cid:0) 1\nk (cid:1)\nIn other words, if there is an of\ufb02ine adaptive k-IPtree, that makes few changes and few mistakes\non the input sequence of IP addresses, then TrackIPTree will also make only a small number of\nmistakes. Due to space constraints, we present the proof in the technical report [23].\n\n\u01eb + 3)\u2206T,z + 10k\n\n\u01eb3 log k\n\nk\n\n2\u01eb2 .\n\n4 Evaluation Setup\nWe now describe our evaluation set-up: data, practical changes to the algorithm, and baseline\nschemes that compare against. While there are many issues that go into converting the algorithm in\nSec. 3 for practical use, we describe here those most important to our experiments, and defer the\nrest to the technical report [23].\n\n5\n\n\fData We focus on IP addresses derived from mail data, since spammers represent a signi\ufb01cant frac-\ntion of the malicious activity and compromised hosts on the Internet [6], and labels are relatively\neasy to obtain from spam-\ufb01ltering run by the mail servers. For our evaluation, we consider labels\nfrom the mail servers\u2019 spam-\ufb01ltering to be ground truth. Any errors in the spam-\ufb01ltering will in\ufb02u-\nence the tree that we construct and our experimental results are limited by this assumption.\nOne data set consists of log extracts collected at the mail servers of a tier-1 ISP with 1 million\nactive mailboxes. The extracts contain the IP addresses of the mail servers that send mail to the\nISP, the number of messages they sent, and the fraction of those messages that are classi\ufb01ed as\nspam, aggregated over 10 minute intervals. The mail server\u2019s spam-\ufb01ltering software consists of a\ncombination of hand-crafted rules, DNS blacklists, and Brightmail [1], and we take their results as\nlabels for our experiments. The log extracts were collected over 38 days from December 2008 to\nJanuary 2009, and contain 126 million IP addresses, of which 105 million are spam and 21 million\nare legitimate.\nThe second data set consists of log extracts from the enterprise mail server of a large corporation with\n1300 active mailboxes. These extracts also contain the IP addresses of mail servers that attempted to\nsend mail, along with the number of messages they sent and the fraction of these messages that were\nclassi\ufb01ed spam by SpamAssassin [2], aggregated over 10 minute intervals. The extracts contain 28\nmillion IP addresses, of which around 1.2 million are legitimate and the rest are spammers.\nNote that in both cases, our data only contains aggregate information about the IP addresses of the\nmail servers sending mail to the ISP and enterprise mail servers, and so we do not have the ability\nto map any information back to individual users of the ISP or enterprise mail servers.\nTrackIPTree For the experimental results, we use LRU as the paging algorithm when nodes need\nto be discarded from the IPtree (Sec. 3.1). In our implementation, we set TrackIPTree to discard\n1% of m, the maximum leaves allowed, every time it needs to expire nodes. The learning rate \u01eb is\nset to 0.05 and the penalty factor \u03b3 for sleeping experts is set to 0.1 respectively. Our results are not\naffected if these parameters are changed by a factor of 2-3.\nWhile we have presented an online learning algorithm, in practice, it will often need to predict\non data without receiving labels of the instances right away. Therefore, we study TrackIPTree\u2019s\naccuracy on the following day\u2019s data, i.e., to compute prediction accuracy of day i, TrackIPTree is\nallowed to update until day i\u22121. We choose intervals of a day\u2019s length to allow the tree\u2019s predictions\nto be updated at least every day.\nApriori Fixed Clusters We compare TrackIPTree to two sets of apriori \ufb01xed clusters: (1) network-\naware clusters, which are a set of unique pre\ufb01xes derived from BGP routing table snapshots [17], and\n(2) /24 pre\ufb01xes. We choose these clusters as a baseline, as they have been the basis of measurement\nstudies discussed earlier (Sec. 1), prior work in IP-based classi\ufb01cation [19, 24], and are even used\nby popular DNS blacklists [3].\nWe use the \ufb01xed clusters to predict the label of an IP in the usual manner: we simply assign an\nIP the label of its longest matching pre\ufb01x among the clusters.Of course, we \ufb01rst need to assign\nthese clusters their own labels. To ensure that they classify as well as possible, we assign them the\noptimal labeling over the data they need to classify; we do this by allowing them to make multiple\npasses over the data. That is, for each day, we assign labels so that the \ufb01xed clusters maximize their\naccuracy on spam for a given required accuracy on legitimate mail 2. It is clear that this experimental\nset-up is favourable to the apriori \ufb01xed clusters.\n\nWe do not directly compare against the algorithm in [11], as it requires every unique IP address in\nthe data set to be instantiated in the tree. In our experiments (e.g., with the ISP logs), this means that\nit requires over 90 million leaves in the tree. We instead focus on practical prior approaches with\nmore cluster sizes in our experiments.\n\n5 Results\nWe report three sets of experimental results regarding the prediction accuracy of TrackIPTree using\nthe experimental set-up of Section 4. While we do not provide an extensive evaluation of our al-\ngorithm\u2019s computational ef\ufb01ciency, we note that our (unoptimized) implementation of TrackIPTree\ntakes under a minute to learn over a million IP addresses, on a 2.4GHz Sparc64-VI core.\n\n2For space reasons, we defer the details of how we assign this labeling to the technical report [23]\n\n6\n\n\fs\nP\nI\n \n\n \n\n \n\nm\na\np\nS\nn\no\ny\nc\na\nr\nu\nc\nc\nA\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n0\n\nTrackIPTree\nNetwork\u2212Aware\n/24 Prefixes\n\n0.5\n\nCoverage on Legit IPs\n(a) Expt 1: ISP logs\n\n1\n\n1\n\n0.98\n\n0.96\n\n0.94\n\ns\nP\nI\n \n\nm\na\np\nS\nn\no\n\n \n\n \n\ny\nc\na\nr\nu\nc\nc\nA\n\n0.92\n0\n\n50k\n10k\n5k\n1k\n\n0.5\n\nCoverage on Legit IPs\n\n1\n\ns\nP\nI\n \n\n \n\n \n\nm\na\np\nS\nn\no\ny\nc\na\nr\nu\nc\nc\nA\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n0\n\nTrackIPTree\nNetwork\u2212Aware\n/24 Prefixes\n\n0.5\n\nCoverage on Legit IPs\n\n1\n\ns\nP\nI\n \n\n \n\n \n\nm\na\np\nS\nn\no\ny\nc\na\nr\nu\nc\nc\nA\n\n1\n\n0.95\n\n0.9\n\n0.85\n0\n\n(b) Expt 1: Enterprise logs\n\ns\nP\nI\n \nt\ni\ng\ne\nL\nn\no\n\n \n\n \nr\no\nr\nr\nE\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nDynamic\nStatic: 5 Days\nStatic: 10 Days\n\n10\n\n20\n\nTime in days\n\n30\n\ns\nP\nI\n \n\nm\na\np\nS\nn\no\n\n \n\n \nr\no\nr\nr\nE\n\n0.15\n\n0.1\n\n0.05\n\n(d) Expt 2: Enterprise logs\n\n(e) Expt 3: Legitimate IPs\n\nFigure 4: Results for Experiments 1, 2, and 3\n\n200k\n100k\n50k\n20k\n\n0.5\n\nCoverage on Legit IPs\n(c) Expt 2: ISP logs\n\n1\n\nDynamic\nStatic: 5 Days\nStatic: 10 Days\n\n10\n\n20\n\nTime in days\n\n30\n\n(f) Expt 3: Spam IPs\n\nOur results compare the fraction of spamming IPs that the clusters classify correctly, subject to\nthe constraint that they classify at least x% legitimate mail IPs correctly (we term this to be the\ncoverage of the legitimate IPs required). Thus, we effectively plot the true positive rate against\nthe true negative rate. (This is just the ROC curve with the x-axis reversed, since we plot the true\npositive against the true negative, instead of plotting the true positive against the false positive.)\nExperiment 1: Comparisons with Apriori Fixed Clusters Our \ufb01rst set of experiments compares\nthe performance of our algorithm with network-aware clusters and /24 IP pre\ufb01xes. Figs. 4(a) & 4(b)\nillustrate the accuracy tradeoff of the three sets of clusters on the two data sets. Clearly, the accuracy\nof TrackIPTree is a tremendous improvement on both sets of apriori \ufb01xed clusters \u2013 for any choice\nof coverage on legitimate IPs, the accuracy of spam IPs by TrackIPTree is far higher than the apriori\n\ufb01xed clusters, even by as much as a factor of 2.5. In particular, note that when the coverage required\non legitimate IPs is 95%, TrackIPTree achieves 95% accuracy in classifying spam on both data sets,\ncompared to the 35 \u2212 45% achieved by the other clusters.\nIn addition, TrackIPTree gains this classi\ufb01cation accuracy using a far smaller tree. Table 1 shows\nthe median number of leaves instantiated by the tree at the end of each day. (To be fair to the \ufb01xed\nclusters, we only instantiate the pre\ufb01xes required to classify the day\u2019s data, rather than all possible\npre\ufb01xes in the clustering scheme.) Table 1 shows that the tree produced by TrackIPTree is a factor\nof 2.5-17 smaller with the ISP logs, and a factor of 20-100 smaller with the enterprise logs. These\nnumbers highlight that the apriori \ufb01xed clusters are perhaps too coarse to classify accurately in parts\nof the IP address space, and also are insuf\ufb01ciently aggregated in other parts of the address space.\nExperiment 2: Changing the Maximum Leaves Allowed Next, we explore the effect of changing\nm, the maximum number of leaves allowed to TrackIPTree. Fig. 4(c) & 4(d) show the accuracy-\ncoverage tradeoff for TrackIPTree when m ranges between 20,000-200,000 leaves for the ISP logs,\nand 1,000-50,000 leaves for the enterprise logs. Clearly, in both cases, the predictive accuracy\nincreases with m only until m is \u201csuf\ufb01ciently large\u201d \u2013 once m is large enough to capture all the\ndistinct subtrees in the underlying optimal IPtree, the predictive accuracy will not increase. While\nthe actual values of m are speci\ufb01c to our data sets, the results highlight the importance of having a\nspace-ef\ufb01cient and \ufb02exible algorithm \u2013 both 10,000 and 100,000 are very modest sizes compared to\nthe number of possible apriori \ufb01xed clusters, or the size of the IPv4 address space, and this suggests\nthat the underlying decision tree required is indeed of a modest size.\nExperiment 3: Does a Dynamic Tree Help? In this experiment, we demonstrate empirically that\nour algorithm\u2019s dynamic aspects do indeed signi\ufb01cantly enhance its accuracy over static clustering\nschemes. The static clustering that we compare to is a tree generated by our algorithm, but one that\nlearns over the \ufb01rst z days, and then stays unchanged. For ease of reference, we call such a tree a\nz-static tree; in our experiments, we set z = 5 and z = 10. We compare these trees by examining\nseparately the errors incurred on legitimate and spam IPs.\n\n7\n\n\fEnterprise\n\n9963\n\nISP\n99942\n1732441\n260132\n\nTrackIPTree\n/24 Pre\ufb01xes\n\nNetwork-aware\n\n1426445\n223025\nTable 1: Sizes of Clustering Schemes\n\nwt\n\nImplication\nStrongly Legit\nWeakly Legit\n\n\u2265 0.2\n[0, 0.2)\n(\u22120.2, 0) Weakly Malicious\nStrongly Malicious\n\u2264 \u22120.2\n\nColour\n\nDark Green\nLight Green\n\nBlue\nWhite\n\nTable 2: Colour coding for IPtree in Fig 1(b)\n\nFig. 4(e) & 4(f) compare the errors of the z-static trees and the dynamic tree on legitimate and spam\nIPs respectively, using the ISP logs. Clearly, both z-static trees degrade in accuracy over time, and\nthey do so on both legitimate and spam IPs. On the other hand, the accuracy of the dynamic tree\ndoes not degrade over this period. Further, the in error grows with time; after 28 days, the 10-static\ntree has almost a factor of 2 higher error on both spam IPs and legitimate IPs.\nDiscussion and Implications Our experiments demonstrate that our algorithm is able to achieve\nhigh accuracy in predicting legitimate and spam IPs, e.g., it can predict 95% of the spam IPs cor-\nrectly, when misclassifying only 5% of the legitimate IPs. However, it does not classify the IPs\nperfectly. This is unsurprising \u2013 achieving zero classi\ufb01cation error in these applications is practi-\ncally infeasible, given IP address dynamics [25]. Nevertheless, our IPtree still provides insight into\nthe malicious activity on the Internet.\nAs an example, we examine a high-level view of the Internet obtained from our tree, and its impli-\ncations. Fig. 1(b) visualizes an IPtree on the ISP logs with 50,000 leaves. It is laid out so that the\nroot pre\ufb01x is near the center, and the pre\ufb01xes grow their children outwards. The nodes are coloured\ndepending on their weights, as shown in Table 2: for node t, de\ufb01ne wt = Pj\u2208Q xj(yj,+ \u2212 yj,\u2212),\nwhere Q is the set of pre\ufb01xes of node t (including node t itself. Thus, the blue central nodes are the\nlarge pre\ufb01xes (e.g., /8 pre\ufb01xes), and the classi\ufb01cation they output is slightly malicious; this means\nthat an IP address without a longer matching pre\ufb01x in the tree is typically classi\ufb01ed to be malicious.\nThis suggests, for example, that an unseen IP address is typically classi\ufb01ed as a spammer by our\nIPtree, which is consistent with the observations of network administrators. A second observation\nwe can make is that the tree has many short branches as well as long branches, suggesting that some\nIP pre\ufb01xes are grown to much greater depth than others. This might happen, for instance, if active IP\naddresses for this application are not distributed uniformly in the address space (and so all pre\ufb01xes\ndo not need to be grown at uniform rates), which is also what we might expect to see based on prior\nwork [16].\n\nOf course, these observations are only examples; a complete analysis of our IPtree\u2019s implications is\npart of our future work. Nevertheless, these observations suggest that our tree does indeed capture\nan appropriate picture of the malicious activity on the Internet.\n\n6 Other Related Work\nIn the networking and databases literature, there has been much interest in designing streaming\nalgorithms to identify IP pre\ufb01xes with signi\ufb01cant network traf\ufb01c [7, 9, 27], but these algorithms\ndo not explore how to predict malicious activity. Previous IP-based approaches to reduce spam\ntraf\ufb01c [22, 24], as mentioned earlier, have also explored individual IP addresses, which are not\nparticularly useful since they are so dynamic [15, 19, 25]. Zhang et al [26] also examine how to\npredict whether known malicious IP addresses may appear at a given network, by analyzing the\nco-occurence of all known malicious IP addresses at a number of different networks. More closely\nrelated is [21], who present algorithms to extract pre\ufb01x-based \ufb01ltering rules for IP addresses that may\nbe used in of\ufb02ine settings. There has also been work on computing decision trees over streaming\ndata [8, 13], but this work assumes that data comes from a \ufb01xed distribution.\n\n7 Conclusion\nWe have addressed the problem of discovering dynamic malicious regions on the Internet. We model\nthis problem as one of adaptively pruning a known decision tree, but with the additional challenges\ncoming from real-world settings \u2013 severe space requirements and a changing target function. We\ndeveloped new algorithms to address this problem, by combining \u201cexperts\u201d algorithms and online\npaging algorithms. We showed guarantees on our algorithm\u2019s performance as a function of the best\npossible pruning of a similar size, and our experimental results on real-world datasets are orders of\nmagnitude better than current approaches.\nAcknowledgements We are grateful to Alan Glasser and Gang Yao for their help with the data\nanalysis efforts.\n\n8\n\n\fReferences\n[1] Brightmail. http://www.brightmail.com.\n[2] SpamAssassin. http://www.spamassassin.apache.org.\n[3] SpamHaus. http://www.spamhaus.net.\n[4] BLUM, A., AND MANSOUR, Y. From external to internal regret.\nConference on Computational Learning Theory (COLT 2005) (2005).\n\nIn In Proceedings of 18th Annual\n\n[5] CESA-BIANCHI, N., FREUND, Y., HAUSSLER, D., HELMBOLD, D. P., SCHAPIRE, R. E., AND WAR-\n\nMUTH, M. K. How to use expert advice. J. ACM 44, 3 (1997), 427\u2013485.\n\n[6] COLLINS, M. P., SHIMEALL, T. J., FABER, S., NAIES, J., WEAVER, R., AND SHON, M. D. Using\nuncleanliness to predict future botnet addresses. In Proceedings of the Internet Measurement Conference\n(2007).\n\n[7] CORMODE, G., KORN, F., MUTHUKRISHNAN, S., AND SRIVASTAVA, D. Diamond in the rough: Find-\ning hierarchical heavy hitters in multi-dimensional data. In SIGMOD \u201904: Proceedings of the 2004 ACM\nSIGMOD international conference on Management of data (2004).\n\n[8] DOMINGOS, P., AND HULTEN, G. Mining high-speed data streams. In Proceedings of ACM SIGKDD\n\n(2000), pp. 71\u201380.\n\n[9] ESTAN, C., SAVAGE, S., AND VARGHESE, G. Automatically inferring patterns of resource consumption\n\nin network traf\ufb01c. In Proceedings of SIGCOMM\u201903 (2003).\n\n[10] FREUND, Y., SCHAPIRE, R. E., SINGER, Y., AND WARMUTH, M. K. Using and combining predictors\nIn Proceedings of the Twenty-Ninth Annual Symposium on the Theory of Computing\n\nthat specialize.\n(STOC) (1997), pp. 334\u2013343.\n\n[11] HELMBOLD, D. P., AND SCHAPIRE, R. E. Predicting nearly as well as the best pruning of a decision\n\ntree. Machine Learning 27, 1 (1997), 51\u201368.\n\n[12] HERBSTER, M., AND WARMUTH, M. Tracking the best expert. Machine Learning 32, 2 (August 1998).\n[13] JIN, R., AND AGARWAL, G. Ef\ufb01cient and effective decision tree construction on streaming data.\nIn\n\nProceedings of ACM SIGKDD (2003).\n\n[14] JUNG, J., KRISHNAMURTHY, B., AND RABINOVICH, M. Flash crowds and denial of service attacks:\nCharacterization and implications for cdns and websites. In Proceedings of the International World Wide\nWeb Conference (May 2002).\n\n[15] JUNG, J., AND SIT, E. An empirical study of spam traf\ufb01c and the use of DNS black lists. In Proceedings\n\nof Internet Measurement Conference (IMC) (2004).\n\n[16] KOHLER, E., LI, J., PAXSON, V., AND SHENKER, S. Observed structure of addresses in IP traf\ufb01c.\n\nIEEE/ACM Transactions in Networking 14, 6 (2006).\n\n[17] KRISHNAMURTHY, B., AND WANG, J. On network-aware clustering of web clients. In Proceedings of\n\nACM SIGCOMM (2000).\n\n[18] MAO, Z. M., SEKAR, V., SPATSCHECK, O., VAN DER MERWE, J., AND VASUDEVAN, R. Analyzing\nIn ACM SIGCOMM Workshop on Large Scale Attack\n\nlarge ddos attacks using multiple data sources.\nDefense (2006).\n\n[19] RAMACHANDRAN, A., AND FEAMSTER, N. Understanding the network-level behavior of spammers. In\n\nProceedings of ACM SIGCOMM (2006).\n\n[20] SLEATOR, D. D., AND TARJAN, R. E. Amortized ef\ufb01ciency of list update and paging rules. In Commu-\n\nnications of the ACM (1985), vol. 28, pp. 202\u2013208.\n\n[21] SOLDO, F., MARKOPOULO, A., AND ARGYRAKI, K. Optimal \ufb01ltering of source address pre\ufb01xes:\n\nModels and algorithms. In Proceedings of IEEE Infocom 2009 (2009).\n\n[22] TWINING, D., WILLIAMSON, M. M., MOWBRAY, M., AND RAHMOUNI, M. Email prioritization:\nReducing delays on legitimate mail caused by junk mail. In USENIX Annual Technical Conference (2004).\n[23] VENKATARAMAN, S., BLUM, A., SONG, D., SEN, S., AND SPATSCHECK, O. Tracking dynamic\n\nsources of malicious activity at internet-scale. Tech. Rep. TD-7NZS8K, AT&T Labs, 2009.\n\n[24] VENKATARAMAN, S., SEN, S., SPATSCHECK, O., HAFFNER, P., AND SONG, D. Exploiting network\n\nstructure for proactive spam mitigation. In Proceedings of Usenix Security\u201907 (2007).\n\n[25] XIE, Y., YU, F., ACHAN, K., GILLUM, E., , GOLDSZMIDT, M., AND WOBBER, T. How dynamic are\n\nIP addresses? In Proceedings of ACM SIGCOMM (2007).\n\n[26] ZHANG, J., PORRAS, P., AND ULRICH, J. Highly predictive blacklists.\n\nSecurity\u201908 (2008).\n\nIn Proceedings of Usenix\n\n[27] ZHANG, Y., SINGH, S., SEN, S., DUFFIELD, N., AND LUND, C. Online identi\ufb01cation of hierarchi-\nIn IMC \u201904: Proceedings of the 4th ACM\n\ncal heavy hitters: algorithms, evaluation, and applications.\nSIGCOMM conference on Internet measurement (New York, NY, USA, 2004), ACM, pp. 101\u2013114.\n\n9\n\n\f", "award": [], "sourceid": 810, "authors": [{"given_name": "Shobha", "family_name": "Venkataraman", "institution": null}, {"given_name": "Avrim", "family_name": "Blum", "institution": null}, {"given_name": "Dawn", "family_name": "Song", "institution": null}, {"given_name": "Subhabrata", "family_name": "Sen", "institution": null}, {"given_name": "Oliver", "family_name": "Spatscheck", "institution": null}]}