p*}. Since with the normal method d(lP, s, 1) = -1 if \nsIT> p*, PB = P{d(lP, s, 1) = -I}. From (1), with uniform weighting the decision bound(cid:173)\nary is where PB = 0.5. If the samples are small (i.e. T < (In 2)/p* < IIp*), d(lP, s, 1) =-1 for \nall s > O. In this case PB = 1 - (1 -p(lP)l Solving for p(lP) at PB = 0.5 using In(1 - x) \"\" -x, \nthe decision boundary is at p(lP) \"\" (In 2)ff > p*. So, for small sample sizes, the normal \nmethod boundary is biased to greater than p* and can be made orders of magnitude larger \nas T becomes smaller. For larger T, e.g. Tp* > 10, this bias will be seen to be negligible. \nOne obvious solution is to have large samples. This is complicated by three effects. The \nfirst is that desired loss rates in data systems are often small; typically in the range \n1O-ti_1O-12. This implies that to be large, samples must be at least 107_1013 packets. For \nthe latter, even at Gbps rates, short packets, and full loading this translates into samples of \nseveral hours of traffic. Even for the first at typical rates, this can translate into minutes of \ntraffic. The second, related problem is that in dynamic data networks, while individual \nconnections may last for significant periods, on the aggregate a given combination of loads \nmay not exist for the requisite period. The third more subtle problem is that in any queue(cid:173)\ning system even with uncorrelated arrival traffic the buffering introduces memory in the \nsystem. A typical sample with losses may contain 100 losses, but a loss trace would show \nthat all of the losses occurred in a single short overload interval. Thus the number of inde(cid:173)\npendent trials can be several orders of magnitude smaller than indicated by the raw sample \nsize indicating that the loads must be stable for hours, days, or even years to get samples \nthat lead to unbiased classification. \nAn alternative approach used in [Hir95] sets d(lP, s, 1) = sIT and models p(lP) directly. The \nprobabilities can vary over orders of magnitude making accurate estimates difficult. Esti(cid:173)\nmating the less variable 10g(p(lP\u00bb with d = 10g(s/1) is complicated by the logarithm being \nundefined for small samples where most samples have no losses so that s = o. \n4 METHODS FOR TREATING BIAS AND VARIANCE \nWe present without proof two preprocessing methods derived and analyzed in [Br096]. \nThe first eliminates the sample bias by choosing an appropriate d and w that directly \nsolves (1) s.t. c(lP) >, <, = 0 if and only if p(lP) <, >, = p* i.e. it is an unbiased estimate as \nto whether the loss rate is above and below p*. This is the weighting method shown in \nTable 1. The relative weighting of samples with loss rates above and below the critical loss \nrate is plotted in Figure 1. For large T, as expected, it reduces to the normal method. \nThe second preprocessing method assigns uniform weighting, but classifies d(lP, s, 1) = 1 \nonly if a certain confidence level, L, is met that the sample represents a combination where \np(lP) < p*. Such a confidence was derived in [Bro96]: \n\nTable 1: Summary of Methods. \n\nSample Class \n\nWeighting, w(j, Sj, TD, when \n\nMethod d(*p*ls,T}=e- P \u00a3..J-'-'-\n\nS \n\n-\n\nl. \n\ni =0 \n\n935 \n\n(2) \n\nFor small T (e.g. T < IIp* and L > 1 - lie), even if s = 0 (no losses), this level is not met. \nBut, a neighborhood of samples with similar load combinations may all have no losses \nindicating that this sample can be classified as having p( $) < p*. Choosing a neighborhood \nrequires a metric, m, between feature vectors, $. In this paper we simply use Euclidean \ndistance. Using the above and solving for T when s = 0, the smallest meaningful neighbor(cid:173)\nhood size is the smallest k such that the aggregate sample is greater than a critical size, \n1'* = -In(1- L)/p*. From (2), this guarantees that if no packets in the aggregate sample are \nlost we can classify it as having p\u00ab(j\u00bb) < p* within our confidence level. For larger samples, \nor where samples are more plentiful and k can afford to be large, (2) can be used directly. \nTable 2 summarizes this aggregate method. \n\nThe above preprocessing methods assume that the training samples consist of independent \nsamples of Bernoulli trials. Because of memory introduced by the buffer and possible cor(cid:173)\nrelations in the arrivals, this is decidedly not true. The methods can still be applied, if sam(cid:173)\nples can be subsampled at every Ith trial where I is large enough so that the samples are \npseudo-independent, i.e. the dependency is not significant for our application. \nA simple graphical method for determining I is as follows. Observing Figure I, if the \nnumber of trials is artificially increased, for small samples the weighting method will tend \nto under weight the trials with errors, so that its decision boundary will be at erroneously \nhigh loss rates. This is the case with correlated samples. The sample size, T, overstates the \nnumber of independent trials. As the subsample factor is increased, the subsample size \nbecomes smaller, the trials become increasingly independent, the weighting becomes \nmore appropriate, and the decision boundary moves closer to the true decision boundary. \nAt some point, the samples are sufficiently independent so that sparser subsampling does \nnot change the decision boundary. By plotting the decision boundary of the classifier as a \nfunction of I, the point where the boundary is independent of the subsample factor indi(cid:173)\ncates a suitable choice for I . \nIn summary, the procedure consists of collecting traffic samples at different combinations \nof traffic loads that do and do not meet quality of service. These are then subsampled with \na factor I determined as above. Then one of the sample preprocessing methods, summa(cid:173)\nrized in Table I, are applied to the data. These preprocessed samples are then used in any \nneural network or classification scheme. Analysis in [Br096] derives the expected bias \n(shown in Figure 2) of the methods when used with an ideal classifier. The normal method \ncan be arbitrarily biased, the weighting method is unbiased, and the aggregate method \nchooses a conservative boundary. Simulation experiments in [Br096] applying it to a well \ncharacterized MIMII queueing system to determine acceptable loads showed that the \nweighting method was able to produce unbiased threshold estimates over a range of val-\n\nTable 2: Aggregate Classification Algorithm \n\n{( i' si,1j)} , metric, m, and confidence level, L. \n\n1. Given Sample (i' si, Ti) E \n2. Calculate T* = -InC 1 - L)/ p*. \n3. Find nearest neighbors no, nl' .. , where no = i and m($nj, $i) ~ m($nj+I' $i) for j ~ O. \n4. Choose smallest k S.t. T' = LTnj ~ T*. Let S' = L sn/ \n\nk \n\nk \n\n. \n\n_ \n\n5. Usmg (2), d(;, s;, T;) = \n\nj=O \n\n{+1 ifP{p($\u00bbp*ls',T}\u00abI-L) \n\nj=O \n\n. \n\n1 \n\n-\n\no.w. \n\n\fT. X. Brown \n\nle+03 r---~'---~'---~'---~,--~'-------\"--'\" \n\nNonnal -\n\nWeighting ------(cid:173)\nAggregate \u00b7 \n\n936 \n\nle+02 \n\n~ \n8 \n~ le+Ol \no \n.f3 \n'\" letOO \n\nle-OI L........_~_~_~_~_~--' \n1000 \n\n0.001 \n\n100 \n\n\\0 \n\n0.01 \n\n0.1 \n\n1 \nTp\u00b7 \n\n~ \n\n. \n,5'le+02 J g le+Ol \n~ \n\nIletOO \n\n\" '\" \n\n.. .................. . - -_' .. n.!.: .. ! ... !<, .. : \n\nle-OI L........_~_~_'--''--~_~--' \n1000 \n\n0 .001 \n\n0.01 \n\n100 \n\n0.1 \n\n\\0 \n\n1 \nTp\u00b7 \n\nFigure 1: Plot of Relative Weighting of \nSamples with Losses Below (w-) and \nAbove (w +) the Critical Loss Rate. \n\nFigure 2: Expected Decision Normalized \nby p*. The nominal boundary is P/P* = 1. \nThe aggregate method uses L = 0.95. \n\nues; and the aggregate method produced conservative estimates that were always below \nthe desired threshold, although in terms of traffic load were only 5% smaller. Even in this \nsimple system where the input traffic is uncorrelated (but the losses become correlated due \nthe memory in the queue), the subsample factor was 12, meaning that good results \nrequired more than 90% of the data be thrown out. \n5 EXPERIMENTS WITH ETHERNET TRAFFIC DATA \nThis paper set out to solve the problem of access control for real world data. We consider a \nsystem where the call combinations consist of individual computer data users trunked onto \na single output link. This is modeled as a discrete-time single-server queueing model \nwhere in each time slot one packet can be processed and zero or more packets can arrive \nfrom the different users. The server has a buffer of fixed length 1000. To generate a realis(cid:173)\ntic arrival process, we use ethernet data traces. The bandwidth of the link was chosen at \nfrom 100100Mbps. With 48 byte packets, the queue packet service rate was the bandwidth \ndivided by 384. All arrival rates are normalized by the service rate. \n5.1 THEDATA \n\nWe used ethemet data described in [Le193] as the August 89 busy hour containing traffic \nranging from busy file-servers/routers to users with just a handful of packets. The detailed \ndata set records every packet's arrival time (to the nearest l00llsec), size, plus source and \ndestination tags. From this, 108 \"data traffic\" sources were generated, one for each com(cid:173)\nputer that generated traffic on the ethernet link. To produce uniform size packets, each eth(cid:173)\nernet packet (which ranged from 64 to 1518 bytes) was split into 2 to 32 48-byte packets \n(partial packets were padded to 48 bytes). Each ethernet packet arrival time was mapped \ninto a particular time slot in the queueing model. All the packets arriving in a times lot are \nimmediately added to the buffer, any buffer overflows would be discarded (counted as \nlost), and if the buffer was non-empty at the start of the timeslot, one packet sent. Ethernet \ncontains a collision protocol so that only one of the sources is sending packets at anyone \ntime onto a lOMbps connection. Decorrelating the sources via random starting offsets, \nproduced independent data sources with the potential for overloads. Multiple copies at dif(cid:173)\nferent offsets produced sufficient loads even for bandwidths greater than 10Mbps. \n\nThe peak data rate with this data is fixed, while the load (the average rate over the one hour \ntrace normalized by the peak rate) ranges over five orders of magnitude. Also troubling, \nanalysis of this data [LeI93] shows that the aggregate traffic exhibits chaotic self-similar \nproperties and suggests that it may be due to the sources' distribution of packet inter(cid:173)\narrival times following an extremely heavy tailed distribution with infinite higher order \nmoments. No tractable closed form solution exists for such data to predict whether a par(cid:173)\nticular load will result in an overload. Thus, we apply adaptive access control. \n\n\fAdaptive Access Control Applied to Ethernet Data \n\n937 \n\n5.2 EXPERIMENT AND RESULTS \n\nWe divided the data into two roughly similar groups of 54 sources each; one for training \nand one for testing. To create sample combinations we assign a distribution over the differ(cid:173)\nent training sources, choose a source combination from this distribution, and choose a ran(cid:173)\ndom, uniform (over the period of the trace) starting time for each source. Simulations that \nreach the end of a trace wrap around to the beginning of the trace. The sources are \ndescribed by a single feature corresponding to the average load of the source over the one \nhour data trace. A group of sources is described by the sum of the average loads. The \nsource distribution was a uniformly chosen O-M copies of each of the 54 training samples. \nM was dynamically chosen so that the link would be sufficiently loaded to cause losses. \nEach sample combination was processed for 3x107 time slots, recording the load combi(cid:173)\nnation, the number of packets serviced correctly, and the number blocked. The experiment \nwas repeated for a range of bandwidths. The bandwidths and number of samples at each \nbandwidth are shown in Table 3 \nWe applied the three methods of Table 1 based on p* = 10-6 (L = 95% for the aggregate \nmethod) and used the resulting data in a linear classifier. Since the feature is the load and \nlarger loads will always cause more blocking,p(*