Large-Scale Network Traffic Analysis for Estimating the Size of IP Addresses and Detecting Traffic Anomalies - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

to accurately estimate sizes. Corporate network address translation (NAT) devices

usually have the same UA on all hosts. Similarly, an Internet cafe host is used by

several users often sharing the same user ID. Meanwhile, small IPs can masquerade

as large IPs by clearing or farming cookies and overwriting UAs in HTTP requests.

Hence, estimating sizes by distinct counting cookies and UAs may result in overes-

timation or underestimation. Filtering traffic based on these inaccurate sizes yields

high false-negatives and false-positives rates.

Instead, advocates [19] using the log files to build statistical models that are later

used for estimating sizes. This approach poses some challenges. First, the log files

do not contain only legitimate traffic. The existence of abusive traffic entries in these

files degrades the quality of the models and the estimated sizes. To avoid such qual-

ity degradation, the models should be built only from the traffic of the trusted users.*

This introduces a sampling bias in the traffic used to build the models. To mitigate

this bias later in the estimation phase, only the trusted traffic of each IP † during a

period, p , is used to estimate its size for p .

Second, the sizes of the IPs change due to legitimate reasons, such as reassign-

ments, flash crowds and business-week cycles. For an estimation period, p , the log

files cannot be finalized before the end of p . They are then analyzed to produce

estimates after each IP has already made its activities during p . Hence, estimated

sizes are always lagging behind real-time sizes. Meanwhile, real-time abuse detec-

tion needs the estimates when p begins. This lag reduces the filtering accuracy when

an IP legitimately changes size.

Given the above challenges, [19] proposed building statistical models for size estima-

tion in an autonomous, passive and privacy-preserving way from aggregated log files,

and predicted size using time series analysis on the estimated size.

14.2.2 t he s ize e stimation C yCle

The cycle of size estimation and filtering is laid out in this section. The basic cycle

consists of four processes that communicate via log files and size lookup tables. For

period p , the inputs and outputs of the real-time traffic event logging, estimation,

predictions, and real-time abuse detection processes are formalized in Equations

14.1, 14.2, 14.3, and 14.4, respectively.

-

------------

RT Logp

( → log

traffic

-

files

(14.1)

p

—— ()

Est p

log - files p ⋈ entry abusive

--

log

files

estimates table

-

(14.2)

p

* Trusted users can be defined as those with some signature of good traffic, where the definition of good

traffic is application-dependent. For combating abusive ad clicks, trusted cookies can be defined as

those with a relatively high conversion rate, where conversions are trusted post-click activities, like

purchases from the advertisers.

† Traffic entries tagged by the abusive click detection filters are logged in abusive log files. Both trusted

and untrusted traffic entries exist in the log files. Only untrusted entries exist in the abusive log files.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home