Large-Scale Network Traffic Analysis for Estimating the Size of IP Addresses and Detecting Traffic Anomalies - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

size by doing iterative variance reduction until the estimates lie within an acceptable

confidence interval. Third, it combines the estimates of all periodicities.

14.4.2 C onsiDering m ultiPle s ize P erioDiCities

Considering the periodicity of IP activity is imperative. Periodicities of the sizes of

the IPs were discovered by selecting a sample of IPs, and applying discrete Fourier

transform to each. The terms with the highest coefficients correspond to the peri-

odicities used by the PredictSizes algorithm [2]. The vast majority of IPs have diur-

nal and weekly periodicities. These periodicities are especially clear for the IPs of

school districts and large institutes.

PredictSizes fetches the estimates of several periodicities, for example, diurnal and

weekly, for each IP to produce its prediction. For n periodicities, s 1 < s 2 < ⋯ < s n ,

PredictSizes considers the most recent w if estimates s if periods apart, for 1 ≤ if ≤ n . For

example, to estimate the sizes of IPs in six hours with all the sliding windows having

length 10, s 1 = 1, s 2 = 4, and s 3 = 28, PredictSizes considers the last 10 six-hour contigu-

ous estimates, as well as the same-slot estimates of the last 10 days and the last 10 weeks.

14.4.3 i iterative v arianCe r eDuCtion

PredictSizes deals with the sizes time series of each periodicity of each IP in isola-

tion. It then combines the predictions from all the periodicities of an IP as discussed

in Section 14.4.4.

For time series predictions, it is typical to do simple trend analysis using simple

linear regression to show consistent increase or decrease over time (allowing for

some white noise) [10]. The trend is then used for extrapolation. However, based on

analysis of numerous IPs, time series of size periodicities almost never show strong

trends within the window of estimates used for predictions. Moreover, using trend

analysis hurts IPs that have drastic size change, since false trends result in erroneous

predictions. Hence, PredictSizes assumes a stable value for each time series. The

stable value, the representative statistic on the time series, is calculated using the

StableSize function and is produced as the prediction.

For simplicity, StableSize deals with each periodicity time series as a set. For

each time series, StableSize does iterative variance reduction by removing outliers

that contribute the most to the variance until the ratio of the width of the confidence

interval to the mean falls to a given bound. The truncated mean of the remaining

sizes is declared the stable size.

At each iteration, StableSize calculates the standard deviation, mean of the time

series, and the width of its c -conidence interval. The element that contributes the

most to the variance is the farthest from the mean. This element can be identified in

constant time by checking the maximum and the minimum elements. This element

is deleted in each iteration. Each time an extreme element is deleted, the new mean

and variance are updated in constant time. The most costly process is identifying

the extreme elements, which can be done efficiently using a minmax heap. The algo-

rithm fails if the time series exhibits little stability, due to abrupt size changes or due

Search WWH ::

Custom Search

Home