{"title": "Throttling Poisson Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 505, "page_last": 513, "abstract": "We study a setting in which Poisson processes generate sequences of decision-making events. The optimization goal is allowed to depend on the rate of decision outcomes; the rate may depend on a potentially long backlog of events and decisions. We model the problem as a Poisson process with a throttling policy that enforces a data-dependent rate limit and reduce the learning problem to a convex optimization problem that can be solved efficiently. This problem setting matches applications in which damage caused by an attacker grows as a function of the rate of unsuppressed hostile events. We report on experiments on abuse detection for an email service.", "full_text": "Throttling Poisson Processes\n\nUwe Dick\n\nPeter Haider\n\nThomas Vanck Michael Br\u00a8uckner\n\nTobias Scheffer\n\nUniversity of Potsdam\n\nDepartment of Computer Science\n\n{uwedick,haider,vanck,mibrueck,scheffer}@cs.uni-potsdam.de\n\nAugust-Bebel-Strasse 89, 14482 Potsdam, Germany\n\nAbstract\n\nWe study a setting in which Poisson processes generate sequences of decision-\nmaking events. The optimization goal is allowed to depend on the rate of decision\noutcomes; the rate may depend on a potentially long backlog of events and de-\ncisions. We model the problem as a Poisson process with a throttling policy that\nenforces a data-dependent rate limit and reduce the learning problem to a convex\noptimization problem that can be solved ef\ufb01ciently. This problem setting matches\napplications in which damage caused by an attacker grows as a function of the rate\nof unsuppressed hostile events. We report on experiments on abuse detection for\nan email service.\n\n1 Introduction\n\nThis paper studies a family of decision-making problems in which discrete events occur on a contin-\nuous time scale. The time intervals between events are governed by a Poisson process. Each event\nhas to be met by a decision to either suppress or allow it. The optimization criterion is allowed to\ndepend on the rate of decision outcomes within a time interval; the criterion is not necessarily a sum\nof a loss function over individual decisions.\nThe problems that we study cannot adequately be modeled as Mavkov or semi-Markov decision\nproblems because the probability of transitioning from any value of decision rates to any other value\ndepends on the exact points in time at which each event occurred in the past. Encoding the entire\nbacklog of time stamps in the state of a Markov process would lead to an unwieldy formalism. The\nlearning formalism which we explore in this paper models the problem directly as a Poisson process\nwith a throttling policy that depends on an explicit data-dependent rate limit, which allows us to\nrefer to a result from queuing theory and derive a convex optimization problem that can be solved\nef\ufb01ciently.\nConsider the following two scenarios as motivating applications.\nIn order to stage a successful\ndenial-of-service attack, an assailant has to post requests at a rate that exceeds the capacity of the\nservice. A prevention system has to meet each request by a decision to suppress it, or allow it\nto be processed by the service provider. Suppressing legitimate requests runs up costs. Passing\nfew abusive requests to be processed runs up virtually no costs. Only when the rate of passed\nabusive requests exceeds a certain capacity, the service becomes unavailable and costs incur. The\nfollowing second application scenario will serve as a running example throughout this paper. Any\nemail service provider has to deal with a certain fraction of accounts that are set up to disseminate\nphishing messages and email spam. Serving the occasional spam message causes no harm other\nthan consuming computational ressources. But if the rate of spam messages that an outbound email\nserver discharges triggers alerting mechanisms of other providers, then that outbound server will\nbecome blacklisted and the service is disrupted. Naturally, suppressing any legitimate message is a\ndisruption to the service, too.\n\n1\n\n\fLet x denote a sequence of decision events x1, . . . , xn; each event is a point xi \u2208 X in an instance\nspace. Sequence t denotes the time stamps ti \u2208 R+ of the decision events with ti < ti+1. We de\ufb01ne\nan episode e by the tuple e = (x, t, y) which includes a label y \u2208 {\u22121, +1}. In our application, an\nepisode corresponds to the sequence of emails sent within an observation interval from a legitimate\n(y = \u22121) or abusive (y = +1) account e. We write xi and ti to denote the initial sequence of the\n\ufb01rst i elements of x and t, respectively. Note that the length n of the sequences can be different for\ndifferent episodes.\nLet A = {\u22121, +1} be a binary decision set, where +1 corresponds to suppressing an event and \u22121\ncorresponds to passing it. The decision model \u03c0 gets to make a decision \u03c0 (xi, ti) \u2208 A at each point\nin time ti at which an event occurs.\n\u2032 for episode e and decision model \u03c0 is a crucial concept.\nThe outbound rate r(cid:25)(t\n\u2032.\nIt counts the number of events that were let pass during a time interval of lengh \u03c4 ending before t\n)}|. In outbound spam\nIt is therefore de\ufb01ned as r(cid:25)(t\nthrottling, \u03c4 corresponds to the time interval that is used by other providers to estimate the incoming\nspam rate.\nWe de\ufb01ne an immediate loss function \u2113 : Y \u00d7A \u2192 R+ that speci\ufb01es the immediate loss of deciding\na \u2208 A for an event with label y \u2208 Y as\n\n\u2032|x, t) = |{i : \u03c0(xi, ti) = \u22121 \u2227 ti \u2208 [t\n\n\u2032|x, t) at time t\n\n\u2032 \u2212 \u03c4, t\n\u2032\n\n{\n\n\u2113(y, a) =\n\nc+ y = +1 \u2227 a = \u22121\nc\u2212 y = \u22121 \u2227 a = +1\n0\n\notherwise,\n\nwhere c+ and c\u2212 are positive constants, corresponding to costs of false positive and false negative\ndecisions. Additionally, the rate-based loss \u03bb : Y \u00d7 R+ \u2192 R+ is the loss that runs up per unit\nof time. We require \u03bb to be a convex, monotonically increasing function in the outbound rate for\ny = +1 and to be 0 otherwise. The rate-based loss re\ufb02ects the risk of the service getting blacklisted\nbased on the current sending behaviour. This risk grows in the rate of spam messages discharged\nand the duration over which a high sending rate of spam messages is maintained.\nThe total loss of a model \u03c0 for an episode e = (x, t, y) is therefore de\ufb01ned as\n\nL(\u03c0; x, t, y) =\n\ntn+(cid:28)\n\n\u03bb (y, r(cid:25)(t\n\n\u2032|x, t)) dt\n\u2032\n\n(1)\n\n(2)\n\n\u222b\n\nt1\n\nn\u2211\n\n+\n\n\u2113 (y, \u03c0(xi, ti))\n\ni=1\n\nThe \ufb01rst term penalizes a high rate of unsuppressed events with label +1\u2014in our example, a high\nrate of unsuppressed spam messages\u2014whereas the second term penalizes each decision individually.\nFor the special case of \u03bb = 0, the optimization criterion resolves to a risk, and the problem becomes\na standard binary classi\ufb01cation problem.\nAn unknown target distribution over p(x, t, y)\ninduces\nEx;t;y[L(\u03c0; x, t, y)]. The learning problem consists in \ufb01nding \u03c0\nfrom a training sample of tuples D = {(x1\n\nthe overall optimization goal\n\u2217\nEx;t;y[L(\u03c0; x, t, y)]\n= argmin(cid:25)\nnm, ym)}.\n\nn1 , y1), . . . , (xm\n\nnm, tm\n\nn1, t1\n\n2 Poisson Process Model\n\nWe assume the following data generation process for episodes e = (x, t, y) that will allow us to\nderive an optimization problem to be solved by the learning procedure. First, a rate parameter \u03c1,\nlabel y, and the sequence of instances x, are drawn from a joint distribution p(x, \u03c1, y). Rate \u03c1 is the\nparameter of a Poisson process p(t|\u03c1) which now generates time sequence t. The expected loss of\ndecision model \u03c0 is taken over all input sequences x, rate parameter \u03c1, label y, and over all possible\nsequences of time stamps t that can be generated according to the Poisson process.\n\nL(\u03c0; x, t, y)p(t|\u03c1)p(x, \u03c1, y)d\u03c1dtdx\n\n(3)\n\n\u222b\n\n\u222b\n\n\u222b\n\n\u2211\n\nx\n\nt\n\n(cid:26)\n\ny\n\nEx;t;y[L(\u03c0; x, t, y)] =\n\n2.1 Derivation of Empirical Loss\n\nIn deriving the empirical counterpart of the expected loss, we want to exploit our assumption that\ntime stamps are generated by a Poisson process with unknown but \ufb01xed rate parameter. For each\n\n2\n\n\finput episode (x, t, y), instead of minimizing the expected loss over the single observed sequence of\ntime stamps, we would therefore like to minimize the expected loss over all sequences of time stamps\ngenerated by a Poisson process with the rate parameter that has most likely generated the observed\n\u2032 into\nsequence of time stamps. Equation 4 introduces the observed time sequence of time stamps t\n\u2032. Equation\nEquation 3 and uses the fact that the rate parameter \u03c1 is independent of x and y given t\n5 rearranges the terms, and Equation 6 writes the central integral as a conditional expected value\nof the loss given the rate \u03c1. Finally, Equation 7 approximates the integral over all values of \u03c1 by a\nsingle summand with value \u03c1\n\n\u2217 for each episode.\n\nEx;t;y[L(\u03c0; x, t, y)] =\n\n\u2211\n\n\u222b\n\n\u222b\n(\u222b\n\nt\n\nt\u2032\n\n\u222b\n\u222b\n(\u222b\n\u2211\n(\u222b\n\u2211\n\u2211\n\ny\n\ny\n\nx\n\n(cid:26)\n\n(cid:26)\n\n\u222b\n\u222b\n\u222b\n\nt\u2032\n\nt\u2032\n\n\u222b\n\u222b\n\u222b\n\nx\n\nx\n\n=\n\n=\n\n\u2248\n\n\u2032\n\n\u2032\n)p(x, t\n\n\u2032\n, y)d\u03c1dtdxdt\n\n)\n\n(cid:26)\n\ny\n\nL(\u03c0; x, t, y)p(t|\u03c1)p(\u03c1|t\n)\np(\u03c1|t\n\nL(\u03c0; x, t, y)p(t|\u03c1)dt\n\n\u2032\n\nt\n\n)\n\n)d\u03c1\n\n(Et [L(\u03c0; x, t, y) | \u03c1] p(\u03c1|t\n\n\u2032\n\n)d\u03c1\n\n\u2032\np(x, t\n\n\u2032\n\np(x, t\n\n, y)dxdt\n\n\u2032\n\n\u2032\n, y)dxdt\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\nEt [L(\u03c0; x, t, y) | \u03c1\n\u2217\n\n\u2032\n\n] p(x, t\n\n, y)dxdt\n\n\u2032\n\nt\u2032\n\nx\n\ny\n\n\u2032\nWe arrive at the regularized risk functional in Equation 8 by replacing p(x, t\nm for all ob-\n\u2217\ne as parameter that generated time stamps te. The\nservations in D and inserting MAP estimate \u03c1\nin\ufb02uence of the convex regularizer \u2126 is determined by regularization parameter \u03b7 > 0.\n\n, y) by 1\n\n\u02c6Ex;t;y[L(\u03c0; x, t, y)] =\n\nwith\n\nm\u2211\n1\nm\ne = argmax(cid:26)p(\u03c1|te)\n\u2217\n\u03c1\n\ne=1\n\nEt [L(\u03c0; xe, t, ye) | \u03c1\n\u2217\ne] + \u03b7\u2126(\u03c0)\n\nMinimizing this risk functional is the basis of the learning procedure in the next section. As noted\nin Section 1, for the special case when the rate-based loss \u03bb is zero, the problem reduces to a\nstandard weighted binary classi\ufb01cation problem and would be easy to solve with standard learning\nalgorithms. However, as we will see in Section 4, the \u03bb-dependent loss makes the task of learning\na decision function hard to solve; attributing individual decisions with their \u201cfair share\u201d of the rate\nloss\u2014and thus estimating the cost of the decision\u2014is problematic. The Erlang learning model of\nSection 3 employs a decision function that allows to factorize the rate loss naturally.\n\n3 Erlang Learning Model\n\nIn the following we derive an optimization problem that is based on modeling the policy as a data-\ndependent rate limit. This allows us to apply a result from queuing theory and approximate the\nempirical risk functional of Equation (8) with a convex upper bound. We de\ufb01ne decision model \u03c0\nin terms of the function f(cid:18)(xi, ti) = exp(\u03b8T\u03d5 (xi, ti)) which sets a limit on the admissible rate of\nevents, where \u03d5 is some feature mapping of the initial sequence (xi, ti) and \u03b8 is a parameter vector.\nThe throttling model is de\ufb01ned as\n\n{ \u22121 (\u201callow\u201d)\n\n+1 (\u201csuppress\u201d)\n\n\u03c0 (xi, ti) =\n\nif r(cid:25)(ti|xi, ti) + 1 \u2264 f(cid:18)(xi, ti)\notherwise.\n\n(9)\n\nThe decision model blocks event xi, if the number of instances that were sent within [ti\u2212 \u03c4, ti), plus\nthe current instance, would exceed rate limit f(cid:18)(xi, ti). We will now transform the optimization goal\nof Equation 8 into an optimization problem that can be solved by standard convex optimization tools.\nTo this end, we \ufb01rst decompose the expected loss of an input sequence given the rate parameter in\nEquation 8 into immediate and rate-dependent loss terms. Note that te denotes the observed training\nsequence whereas t serves as expectation variable for the expectation Et[\u00b7|\u03c1e\n] over all sequences\n\n\u2217\n\n3\n\n\fconditional on the Poisson process rate parameter \u03c1e\n\n\u2217 as in Equation 8.\n\nEt [L(\u03c0; xe, t, ye) | \u03c1\n\u2217\ne]\n\ntne +(cid:28)\n\n\u03bb (ye, r(cid:25)(t\n\nt1\n\n\u2032|xe, t)) dt\n\n\u2032 | \u03c1\n\u2217\ne\n\nEt[\u2113(ye, \u03c0(xe\n\ni , ti)) | \u03c1\n\u2217\ne]\n\ntne +(cid:28)\n\n\u03bb (ye, r(cid:25)(t\n\n\u2032|xe, t)) dt\n\n\u2032 | \u03c1\n\u2217\ne\n\nEt\n\n\u03b4\n\ni , ti) \u0338= ye\n\n\u03c0(xe\n\n) | \u03c1\n\n\u2217\ne\n\n(10)\n\n]\n\u2113(ye,\u2212ye) (11)\n\n]\n]\n\nne\u2211\nne\u2211\n\ni=1\n\ni=1\n\n+\n\n+\n\n[\n\n(\n\n[\u222b\n[\u222b\n\nt1\n\n= Et\n\n= Et\n\n\u222b\n\n\ufb01rst\n\n\u03bb (ye, r(cid:25)(t\n\napproximation\n\nderive\na\n\u2032|xe, t)) dt\n\ni , ti) \u0338= ye, incur a positive loss \u2113(y, \u03c0(xe\nexpected\n\nEquation 10 uses the de\ufb01nition of the loss function in Equation 2. Equation 11 exploits that only\ndecisions against the correct label, \u03c0(xe\nconvex\nWe will\nloss\n\u2032|\u03c1\n\u2217\nEt[\ntne+(cid:28)\ne] (left side of Equation 11). Our de\ufb01nition of the decision\nt1\nmodel allows us to factorize the expected rate-based loss into contributions of individual rate limit\ndecisions. The convexity will be addressed by Theorem 1.\nSince the outbound rate r(cid:25) increases only at decision points ti, we can upper-bound its value with\nthe value immediately after the most recent decision in Equation 12. Equation 13 approximates\nthe actual outbound rate with the rate limit given by f(cid:18)(xe\ni ). This is reasonable because the\noutbound rate depends on the policy decisions which are de\ufb01ned in terms of the rate limit. Because\nt is generated by a Poisson process, Et[ti+1 \u2212 ti | \u03c1\n\u2217\ne] = 1\n(cid:26)\u2217\n\n(Equation 14).\n\nrate-based\n\ni , ti)).\n\ni , te\n\nthe\n\nof\n\ne\n\n[\u222b\n\nEt\n\ntne +(cid:28)\n\nt1\n\n]\n\n\u2032|xe, t)) dt\n\n\u2032 | \u03c1\n\u2217\ne\n\n\u03bb (ye, r(cid:25)(t\n\n\u2264 ne\u22121\u2211\n\u2248 ne\u22121\u2211\nne\u22121\u2211\n\ni=1\n\ni=1\n\n=\n\ni=1\n\n(\n\n1\n\u2217 \u03bb\n\u03c1e\n\ne]\u03bb(ye, r(cid:25)(ti|xe, t)) + \u03c4 \u03bb(ye, r(cid:25)(tne|xe, t))\nEt[ti+1 \u2212 ti | \u03c1\n\u2217\n(\n\n(\n\n)\n\nye, f(cid:18)(xe\n\ni , te\ni )\n\n+ \u03c4 \u03bb\n\nye, f(cid:18)(xe\n\nne , te\n\nne ))\n\nEt[ti+1 \u2212 ti | \u03c1\n\u2217\ne]\u03bb\n\n)\n\n(\n\n)\n\nye, f(cid:18)(xe\n\ni , te\ni )\n\n+ \u03c4 \u03bb\n\nye, f(cid:18)(xe\n\nne , te\n\nne )\n\nWe have thus established a convex approximation of the left side of Equation 11.\ni , ti) \u0338= ye) | \u03c1\n\u2217\nWe will now derive a closed form approximation of Et[\u03b4 (\u03c0(xe\ne], the second part of\nthe loss functional in Equation 11. Queuing theory provides a convex approximation: The Erlang-B\nformula [5] gives the probability that a queuing process which maintains a constant rate limit of\nf within a time interval of \u03c4 will block an event when events are generated by a Poisson process\nwith given rate parameter \u03c1. Fortet\u2019s formula (Equation 15) generalizes the Erlang-B formula for\nnon-integer rate limits.\n\n\u222b \u221e\n0 e\u2212z(1 + z\n\n1\n\n(cid:26)(cid:28) )f dz\n\nB(f, \u03c1\u03c4 ) =\n\nThe integral can be computed ef\ufb01ciently using a rapidly converging series, c.f. [5]. The formula\nrequires a constant rate limit, so that the process can reach an equilibrium. In our model, the rate\nlimit f(cid:18)(xi, ti) is a function of the sequences xi and ti until instance xi, and Fortet\u2019s formula\ntherefore serves as an approximation.\ni , ti) = 1)|\u03c1\n\nEt [\u03b4(\u03c0(xe\n\n(16)\n\n[\u222b \u221e\ne] \u2248 B(f(cid:18)(xe\n\u2217\n\n]\u22121\n\n)f(cid:18)(xe\n\ni ;te\n\ni )dz\n\n(17)\n\ni , te\ni ), \u03c1\n\u2212z(1 +\n\n\u2217\ne\u03c4 )\nz\n\u03c1\u2217\ne\u03c4\n\n=\n\ne\n\n0\n\nUnfortunately, Equation 17 is not convex in \u03b8. We approximate it with the convex upper bound\n\u2212 log (1 \u2212 B(f(cid:18)(xe\nthe dashed green line in the left panel of Figure 2(b) for an\nillustration). This is an upper bound, because \u2212 log p \u2265 1 \u2212 p for 0 \u2264 p \u2264 1; its convexity\n\u2217\nis addressed by Theorem 1. Likewise, Et [\u03b4(\u03c0(xe\ne] is approximated by upper bound\ni , ti) \u0338= ye)|\u03c1\n\u2217\ne].\nlog (B(f(cid:18)(xe\n\n\u2217\ne\u03c4 )). We have thus derived a convex upper bound of Et[\u03b4 (\u03c0(xe\ni ), \u03c1\n\ni , ti) =\u22121)|\u03c1\n\n\u2217\ne\u03c4 )) (cf.\n\ni , te\n\ni , te\n\ni ), \u03c1\n\n4\n\n(12)\n\n(13)\n\n(14)\n\n(15)\n\n\fCombining the two components of the optimization goal (Equation 11) and adding convex regular-\nizer \u2126(\u03b8) and regularization parameter \u03b7 > 0 (Equation 8), we arrive at an optimization problem for\n\ufb01nding the optimal policy parameters \u03b8.\nOptimization Problem 1 (Erlang Learning Model). Over \u03b8, minimize\n\nR(\u03b8) =\n\n1\nm\n\nm\u2211\n\ne=1\n\n{\n\nne\u22121\u2211\nne\u2211\n\ni=1\n\n+\n\ni=1\n\n(\n\n)\n\n(\n[\n\u03b4(ye= 1) \u2212 yeB\n\nye, f(cid:18)(xe\n\ni , te\ni )\n\n(\n\n1\n\u2217 \u03bb\n\u03c1e\n\u2212 log\n\n)\n\n+ \u03c4 \u03bb\n\nye, f(cid:18)(xe\n\nne , te\n\nne )\n\nf(cid:18)(xe\n\ni , te\n\n\u2217\ne\u03c4\ni ), \u03c1\n\n)]\n\n}\n\u2113(ye,\u2212ye)\n\n(18)\n\n+ \u03b7\u2126(\u03b8)\n\nNext we show that minimizing risk functional R amounts to solving a convex optimization problem.\n\u2217\nTheorem 1 (Convexity of R). R(\u03b8) is a convex risk functional in \u03b8 for any \u03c1\ne > 0 and \u03c4 > 0.\n\ne\u03c4 )) and \u2212 log(1 \u2212 B(f(cid:18)(\u00b7), \u03c1\n\u2217\n\nProof. The convexity of \u03bb and \u2126 follows from their de\ufb01nitions. It remains to be shown that both\n\u2212 log B(f(cid:18)(\u00b7), \u03c1\ne) are convex in \u03b8. Component \u2113(ye,\u2212ye) of Equa-\n\u2217\n\u2217\ntion 18 is independent of \u03b8. It is known that Fortet\u2019s formula B(f, \u03c1e\n\u03c4 )) is convex, monotically\ne\u03c4 > 0 [5]. Furthermore \u2212 log(B(f, \u03c1\n\u2217\n\u2217\ndecreasing, and positive in f for \u03c1\ne\u03c4 ))) is convex and mono-\ntonically increasing. Since f(cid:18)(\u00b7) is convex in \u03b8, it follows that \u2212 log(B(f(cid:18)(\u00b7), \u03c1\n\u2217\ne)) is also convex.\nNext, we show that \u2212 log(1 \u2212 B(f(cid:18)(\u00b7), \u03c1\n\u2217\ne\u03c4 ))) is convex and monotonically decreasing. From the\nabove it follows that b(f ) = 1 \u2212 B(f, \u03c1\n\u2217\ne\u03c4 )) is monotonically increasing, concave and positive.\n\u2265 0 as both summands are positive. Again, it\ndf 2 \u2212 ln(b(f )) = 1\nTherefore, d2\nb2(f ) b\n(f ) + b\nfollows that \u2212 log(1 \u2212 B(f(cid:18)(\u00b7), \u03c1\n\u2217\ne\u03c4 ))) is convex in \u03b8 due to the de\ufb01nition of f(cid:18).\n\n\u22121\nb(f )\n\n(f )\n\n\u2032\u2032\n\n\u2032\n\n4 Prior Work and Reference Methods\n\n\u2217\n\nWe will now discuss how the problem of minimizing the expected loss, \u03c0\n=\nEx;t;y[L(\u03c0; x, t, y)], from a sample of sequences x of events with labels y and observed\nargmin(cid:25)\n\u2217 relates to previously studied methods. Sequential decision-making problems\nrate parameters \u03c1\nare commonly solved by reinforcement learning approaches, which have to attribute the loss of an\nepisode (Equation 2) to individual decisions in order to learn to decide optimally in each state. Thus,\na crucial part of de\ufb01ning an appropriate procedure for learning the optimal policy consists in de\ufb01n-\ning an appropriate state-action loss function. Q(cid:25)(s, a) estimates the loss of performing action a in\nstate s when following policy \u03c0 for the rest of the episode.\nSeveral different state-action loss functions for related problems have been investigated in the litera-\nture. For example, policy gradient methods such as in [4] assign the loss of an episode to individual\ndecisions proportional to the log-probabilities of the decisions. Other approaches use sampled esti-\nmates of the rest of the episode Q(si, ai) = L(\u03c0, s) \u2212 L(\u03c0, si) or the expected loss if a distribution\nof states of the episode is known [7]. Such general purpose methods, however, are not the optimal\nchoice for the particular problem instance at hand. Consider the special case \u03bb = 0, where the\nproblem reduces to a sequence of independent binary decisions. Assigning the cumulative loss of\nthe episode to all instances leads to a grave distortion of the optimization criterion.\nAs reference in our experiments we use a state-action loss function that assigns the immediate loss\n\u2113(y, ai) to state si only. Decision ai determines the loss incurred by \u03bb only for \u03c4 time units, in\n\u2032. Thus, the loss of\nthe interval [ti, ti + \u03c4 ). The corresponding rate loss is\ndeciding ai = \u22121 instead of ai = +1 is the difference in the corresponding \u03bb-induced loss. Let\n\u2212i, t\n\u2212i denote the sequence x, t without instance xi. This leads to a state-action loss function that\nx\nis the sum of immediate loss and \u03bb-induced loss; it serves as our \ufb01rst baseline.\n\u2212i) + 1)\u2212 \u03bb(y, r(cid:25)(t\n\n(19)\n\u2032|x, t)) with \u03c4 \u03bb(y, r(cid:25)(ti|x, t)), we de\ufb01ne the state-action loss\nBy approximating\nfunction of a second plausible state-action loss that, instead of using the observed loss to estimate\n\nit(si, a) = \u2113(y, a) + \u03b4(a =\u22121)\nQ(cid:25)\n\n\u2032|x, t))dt\n\n\u2212i))dt\n\n\u03bb(y, r(cid:25)(t\n\n\u03bb(y, r(cid:25)(t\n\n\u03bb(y, r(cid:25)(t\n\n\u2212i, t\n\n\u2212i, t\n\nti+(cid:28)\nti\n\nti+(cid:28)\nti\n\n\u2032|x\n\n\u2032|x\n\n\u222b\n\n\u222b\n\n\u222b\n\nti+(cid:28)\n\nti\n\n\u2032\n\n5\n\n\fthe loss of an action, approximates it with the loss that would be incurred by the current outbound\nrate r(cid:25)(ti|x\n\n\u2212i, t\n\n\u2212i) for \u03c4 time units.\nub(si, a) = \u2113(y, a) + \u03b4(a =\u22121)\nQ(cid:25)\n\n[\n\n(\n\u03bb(y, r(cid:25)(ti|x\n\n)]\n\n\u2212i, t\n\n\u2212i) + 1) \u2212 \u03bb(y, r(cid:25)(ti|x\n\n\u2212i, t\n\n\u2212i))\n\n\u03c4\n\n(20)\nThe state variable s has to encode all information a policy needs to decide. Since the loss crucially\n\u2032|x, t), any throttling model must have access to the current outbound\ndepends on outbound rate r(cid:25)(t\nrate. The transition between a current and a subsequent rate depends on the time at which the next\nevent occurs, but also on the entire backlog of events, because past events may drop out of the\ninterval \u03c4 at any time. In analogy to the information that is available to the Erlang learning model,\nit is natural to encode states si as a vector of features \u03d5(xi, ti) (see Section 5 for details) together\nwith the current outbound rate r(cid:25)(ti|x, t). Given a representation of the state and a state-action loss\nfunction, different approaches for de\ufb01ning the policy \u03c0 and optimizing its parameters have been\ninvestigated. For our baselines, we use the following two methods.\n\nPolicy gradient. Policy gradient methods model a stochastic policy directly as a parameterized\ndecision function. They perform a gradient descent that always converges to a local optimum [8].\nThe gradient of the expected loss with respect to the parameters is estimated in each iteration k for\nthe distribution over episodes, states, and losses that the current policy \u03c0k induces. However, in\norder to achieve fast convergence to the optimal polity, one would need to determine the gradient for\nthe distribution over episodes, states, and losses induced by the optimal policy. We implement two\npolicy gradient algorithms for experimentation which only differ in using Qit and Qub, respectively.\nThey are denoted PGit and PGub in the experiments. Both use a logistic regression function as\ndecision function, the two-class equivalent of the Gibbs distribution which is used in the literature.\n\nIterative Classi\ufb01er. The second approach is to represent policies as classi\ufb01ers and to employ\nmethods for supervised classi\ufb01cation learning. A variety of papers addresses this approach [6, 3, 7].\nWe use an algorithm that is inspired by [1, 2] and is adapted to the problem setting at hand. Blatt\nand Hero [2] investigate an algorithm that \ufb01nds non-stationary policies for two-action T-step MDPs\nby solving a sequence of one-step decisions via a binary classi\ufb01er. Classi\ufb01ers \u03c0t for time step t are\nlearned iteratively on the distribution of states generated by the policy (\u03c00, . . . , \u03c0t\u22121). Our derived\nalgorithm iteratively learns weighted support vector machine (SVM) classi\ufb01er \u03c0k+1 in iteration k+1\non the set of instances and losses Q(cid:25)k (s, a) that were observed after classi\ufb01er \u03c0k was used as policy\non the training sample. The weight vector of \u03c0k is denoted \u03b8k. The weight of misclassi\ufb01cation of s\nis given by Q(cid:25)k (s,\u2212y). The SVM weight vector is altered in each iteration as \u03b8k+1 = (1\u2212 \u03b1k)\u03b8k +\n\u02c6\u03b8, where \u02c6\u03b8 is the weight vector of the new classi\ufb01er that was learned on the observed losses. In\n\u03b1k\nthe experiments, two iterative SVM learner were implemented, denoted It-SVMit and It-SVMub,\ncorresponding to the used state-action losses Qit and Qub, respectively. Note that for the special\ncase \u03bb = 0 the iterative SVM algorithm reduces to a standard SVM algorithm.\nAll four procedures iteratively estimate the loss of a policy decision on the data via a state-action\nloss function and learn a new policy \u03c0 based on this estimated cost of the decisions. Convergence\nguarantees typically require the Markov assumption; that is, the process is required to possess a\nstationary transition distribution P (si+1|si, ai). Since the transition distribution in fact depends\non the entire backlog of time stamps and the duration over which state si has been maintained,\nthe Markov assumption is violated to some extent in practice.\nIn addition to that, \u03bb-based loss\nestimates are sampled from a Poisson process. In each iteration \u03c0 is learned to minimize sampled\nand inherently random losses of decisions. Thus, convergence to a robust solution becomes unlikely.\nIn contrast, the Erlang learning model directly minimizes the \u03bb-loss by assigning a rate limit. The\nrate limit implies an expectation of decisions. In other words, the \u03bb-based loss is minimized without\nexplicitely estimating the loss of any decisions that are implied by the rate limit. The convexity of\nthe risk functional in Optimization Problem 1 guarantees convergence to the global optimum.\n\n5 Application\n\nThe goal of our experiments is to study the relative bene\ufb01ts of the Erlang learning model and the\nfour reference methods over a number of loss functions. The subject of our experimentation is the\nproblem of suppressing spam and phishing messages sent from abusive accounts registered at a\nlarge email service provider. We sample approximately 1,000,000 emails sent from approximately\n\n6\n\n\fFigure 1: Average loss on test data depending on the in\ufb02uence of the rate loss c(cid:21) for different\nimmediate loss constants c\u2212 and c+.\n\n10,000 randomly selected accounts over two days and label them automatically based on information\npassed by other email service providers via feedback loops (in most cases triggered by \u201creport spam\u201d\nbuttons). Because of this automatic labeling process, the labels contain a certain smount of noise.\nFeature mapping \u03d5 determines a vector of moving average and moving variance estimates of several\nattributes of the email stream. These attributes measure the frequency of subject changes and sender\naddress changes, and the number of recipients. Other attributes indicate whether the subject line\nor the sender address have been observed before within a window of time. Additionally, a moving\naverage estimate of the rate \u03c1 is used as feature. Finally, other attributes quantify the size of the\nmessage and the score returned by a content-based spam \ufb01lter employed by the email service.\nWe implemented the baseline methods that were descibed in Section 4, namely the iterative SVM\nmethods It-SVMub and It-SVMit and the policy gradient methods PGub and PGit. Additionally,\nwe used a standard support vector machine classi\ufb01er SVM with weights of misclassi\ufb01cation corre-\nsponding to the costs de\ufb01ned in Equation 1. The Erlang learning model is denoted ELM in the plots.\nLinear decision functions were used for all baselines.\nIn our experiments, we assume a cost\nis,\n\u2032|x, t)2 with c(cid:21) > 0 determining the in\ufb02uence of the rate loss to the\n\u03bb(1, r(cid:25)(t\noverall loss. The time interval \u03c4 was chosen to be 100 seconds. Regularizer \u2126(\u03b8) as in Optimization\nproblem 1 is the commonly used squared l2-norm \u2126(\u03b8) = \u2225\u03b8\u22252\n2.\nWe evaluated our method for different costs of incorrectly classi\ufb01ed non-spam emails (c\u2212), incor-\nrectly classi\ufb01ed spam emails (c+) (see the de\ufb01nition of \u2113 in Equation 1), and rate of outbound spam\nmessages (c(cid:21)). For each setting, we repeated 100 runs; each run used about 50%, chosen at random,\nas training data and the remaining part as test data. Splits where chosen such that there were equally\nmany spam episodes in training and test set. We tuned the regularization parameter \u03b7 for the Erlang\nlearning model as well as the corresponding regularization parameters of the iterative SVM methods\nand the standard SVM on a separate tuning set that was split randomly from the training data.\n\nis quadratic in the outbound rate.\n\n\u2032|x, t))) = c(cid:21) \u00b7 r(cid:25)(t\n\nthat\n\nThat\n\n5.1 Results\n\nFigure 1 shows the resulting average loss of the Erlang learning model and reference methods.\nEach of the three plots shows loss versus parameter c(cid:21) which determines the in\ufb02uence of the rate\nloss on the overall loss. The left plot shows the loss for c\u2212 = 5 and c+ = 1, the center plot for\n(c\u2212 = 10, c+ = 1), and the right plot for (c\u2212 = 20, c+ = 1).\nWe can see in Figure 1 that the Erlang learning model outperforms all baseline methods for larger\nvalues of c(cid:21)\u2014more in\ufb02uence of the rate dependent loss on the overall loss\u2014in two of the three\nsettings. For c\u2212 = 20 and c+ = 1 (right panel), the performance is comparable to the best baseline\nmethod It-SVMub; only for the largest shown c(cid:21) = 5 does the ELM outperform this baseline. The\niterative classi\ufb01er It-SVMub that uses the approximated state-action loss Qub performs uniformly\nbetter than It-SVMit, the iterative SVM method that uses the sampled loss from the previous it-\neration. It-SVMit itself surprisingly shows very similar performance to that of the standard SVM\nmethod; only for the setting c\u2212 = 20 and c+ = 1 in the right panel does this iterative SVM method\nshow superior performance. Both policy gradient methods perform comparable to the Erlang learn-\ning model for smaller values of c(cid:21) but deteriorate for larger values.\n\n7\n\n 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5Lossc\u03bbc-=5, c+=1ELMIt-SVMitIt-SVMubPGubPGitSVM 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5Lossc\u03bbc-=10, c+=1ELMIt-SVMitIt-SVMubPGubPGitSVM 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5Lossc\u03bbc-=20, c+=1ELMIt-SVMitIt-SVMubPGubPGitSVM\f(a) Average loss and standard error for\nsmall values of c\u03bb.\n\n(b) Left: Fortet\u2019s formula B(e\u03d5\u03b8; (cid:26)(cid:28) ) (Equation 17) and its\nupper bound (cid:0) log(1 (cid:0) B(e\u03d5\u03b8; (cid:26))) for (cid:26)(cid:28) = 10. Right:\n1 (cid:0) B(e\u03d5\u03b8; (cid:26)) and respective upper bound (cid:0) log(B(e\u03d5\u03b8; (cid:26))).\n\nAs expected, the iterative SVM and the standard SVM algorithms perform better than the Erlang\nlearning model and policy gradient models if the in\ufb02uence of the rate pedendent loss is very small.\nThis can best be seen in Figure 2(a). It shows a detail of the results for the setting c\u2212 = 5 and\nc+ = 1, for c(cid:21) ranging only from 0 to 1. This is the expected outcome following the considerations\nin Section 4. If c(cid:21) is close to 0, the problem approximately reduces to a standard binary classi\ufb01-\ncation problem, thus favoring the very good classi\ufb01cation performance of support vector machines.\nHowever, for larger c(cid:21) the in\ufb02uence of the rate dependent loss rises and more and more dominates\nthe immediate classi\ufb01cation loss \u2113. Consequently, for those cases \u2014 which are the important ones in\nthis real world application \u2014 the better rate loss estimation of the Erlang learning model compared\nto the baselines leads to better performance.\nThe average training times for the Erlang learning model and the reference methods are in the same\norder of magnitude. The SVM algorithm took 14 minutes in average to converge to a solution. The\nErlang learning model converged after 44 minutes and the policy gradient methods took approxi-\nmately 45 minutes. The training times of the iterative classi\ufb01er methods were about 60 minutes.\n\n6 Conclusion\n\nWe devised a model for sequential decision-making problems in which events are generated by a\nPoisson process and the loss may depend on the rate of decision outcomes. Using a throttling policy\nthat enforces a data-dependent rate-limit, we were able to factor the loss over single events. Applying\na result from queuing theory led us to a closed-form approximation of the immediate event-speci\ufb01c\nloss under a rate limit set by a policy. Both parts led to a closed-form convex optimization problem.\nOur experiments explored the learning model for the problem of suppressing abuse of an email\nservice. We observed signi\ufb01cant improvements over iterative reinforcement learning baselines. The\nmodel is being employed to this end in the email service provided by web hosting \ufb01rm STRATO.\nIt has replaced a procedure of manual deactivation of accounts after inspection triggered by spam\nreports.\n\nAcknowledgments\n\nWe gratefully acknowledge support from STRATO Rechenzentrum AG and the German Science\nFoundation DFG.\n\nReferences\n[1] J.A. Bagnell, S. Kakade, A. Ng, and J. Schneider. Policy search by dynamic programming.\n\nAdvances in Neural Information Processing Systems, 16, 2004.\n\n[2] D. Blatt and A.O. Hero. From weighted classi\ufb01cation to policy search. Advances in Neural\n\nInformation Processing Systems, 18, 2006.\n\n[3] C. Dimitrakakis and M.G. Lagoudakis. Rollout sampling approximate policy iteration. Machine\n\nLearning, 72(3):157\u2013171, 2008.\n\n8\n\n 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1Lossc\u03bbc-=5, c+=1ELMIt-SVMitIt-SVMubPGubPGitSVM 0 1 2 0 1 2 3\u03c6*\u03b8Fortet functionwith convex upper boundB(exp(\u03c6\u03b8),\u03c1\u03c4)-log(1-B(exp(\u03c6\u03b8),\u03c1\u03c4)) 0 1 2-1 0 1 2\u03c6*\u03b8Complement of Fortet functionwith convex upper bound1-B(exp(\u03c6\u03b8),\u03c1\u03c4)-log(B(exp(\u03c6\u03b8),\u03c1\u03c4))\f[4] M. Ghavamzadeh and Y. Engel. Bayesian policy gradient algorithms. Advances in Neural\n\nInformation Processing Systems, 19, 2007.\n\n[5] D.L. Jagerman, B. Melamed, and W. Willinger. Stochastic modeling of traf\ufb01c processes. Fron-\n\ntiers in queueing: models, methods and problems, pages 271\u2013370, 1996.\n\n[6] M.G. Lagoudakis and R. Parr. Reinforcement learning as classi\ufb01cation: Leveraging modern\n\nclassi\ufb01ers. In Proceedings of the 20th International Conference on Machine Learning, 2003.\n\n[7] J. Langford and B. Zadrozny. Relating reinforcement learning performance to classi\ufb01cation\nperformance. In Proceedings of the 22nd International Conference on Machine learning, 2005.\n[8] R.S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforce-\nment learning with function approximation. Advances in Neural Information Processing Sys-\ntems, 12, 2000.\n\n9\n\n\f", "award": [], "sourceid": 911, "authors": [{"given_name": "Uwe", "family_name": "Dick", "institution": null}, {"given_name": "Peter", "family_name": "Haider", "institution": null}, {"given_name": "Thomas", "family_name": "Vanck", "institution": null}, {"given_name": "Michael", "family_name": "Br\u00fcckner", "institution": null}, {"given_name": "Tobias", "family_name": "Scheffer", "institution": null}]}