Database Reference
In-Depth Information
to accurately estimate sizes. Corporate network address translation (NAT) devices
usually have the same UA on all hosts. Similarly, an Internet cafe host is used by
several users often sharing the same user ID. Meanwhile, small IPs can masquerade
as large IPs by clearing or farming cookies and overwriting UAs in HTTP requests.
Hence, estimating sizes by distinct counting cookies and UAs may result in overes-
timation or underestimation. Filtering traffic based on these inaccurate sizes yields
high false-negatives and false-positives rates.
Instead, advocates [19] using the log files to build statistical models that are later
used for estimating sizes. This approach poses some challenges. First, the log files
do not contain only legitimate traffic. The existence of abusive traffic entries in these
files degrades the quality of the models and the estimated sizes. To avoid such qual-
ity degradation, the models should be built only from the traffic of the trusted users.*
This introduces a sampling bias in the traffic used to build the models. To mitigate
this bias later in the estimation phase, only the trusted traffic of each IP during a
period, p , is used to estimate its size for p .
Second, the sizes of the IPs change due to legitimate reasons, such as reassign-
ments, flash crowds and business-week cycles. For an estimation period, p , the log
files cannot be finalized before the end of p . They are then analyzed to produce
estimates after each IP has already made its activities during p . Hence, estimated
sizes are always lagging behind real-time sizes. Meanwhile, real-time abuse detec-
tion needs the estimates when p begins. This lag reduces the filtering accuracy when
an IP legitimately changes size.
Given the above challenges, [19] proposed building statistical models for size estima-
tion in an autonomous, passive and privacy-preserving way from aggregated log files,
and predicted size using time series analysis on the estimated size.
14.2.2 t he s ize e stimation C yCle
The cycle of size estimation and filtering is laid out in this section. The basic cycle
consists of four processes that communicate via log files and size lookup tables. For
period p , the inputs and outputs of the real-time traffic event logging, estimation,
predictions, and real-time abuse detection processes are formalized in Equations
14.1, 14.2, 14.3, and 14.4, respectively.
-
------------
RT Logp
( log
traffic
-
files
(14.1)
p
p
—— ()
Est p
log - files p entry abusive
--
log
files
estimates table
-
(14.2)
p
p
* Trusted users can be defined as those with some signature of good traffic, where the definition of good
traffic is application-dependent. For combating abusive ad clicks, trusted cookies can be defined as
those with a relatively high conversion rate, where conversions are trusted post-click activities, like
purchases from the advertisers.
Traffic entries tagged by the abusive click detection filters are logged in abusive log files. Both trusted
and untrusted traffic entries exist in the log files. Only untrusted entries exist in the abusive log files.
Search WWH ::




Custom Search