Database Reference
In-Depth Information
Pr
d
()
p
−
ipw
p
2
∀
estimates table
-
-—
→
predictions
-
table
p
(14.3)
=−−
1
i
−−− −−−−−−−−
()
RT AbuseDtct p
-
-
traffic
p
⋈
IP
predictions table
p
-
abusive
--
log
files
p
(14.4)
Real-time logging, denoted
RT
-
Log(p)
in Equation 14.1, finalizes the traffic
log
-
iles
p
as
p
+ 1 starts. Next, the
log-iles
p
are consumed, among other input, by the
estimation process,
Est(p)
, to produce the
estimates-table
p
mapping IPs that issued
traffic during
p
to their estimated sizes (Equation 14.2). Next, the algorithm for pre-
dicting sizes,
Prd
(
p
+ 2), consumes the
estimates-tables
from a sliding window of
length
w
periods,*
p
-
w
+ 1 through
p
, to produce the
predictions-table
p
+2
. This
prediction process,
Prd
(
p
+ 2), is assumed to complete before
p
+ 2, and produce
predictions-table
p
+2
, mapping IPs to their predicted sizes of period
p
+ 2.
The
estimates-tables
that contributed to
predictions-table
p
are shown in Equation
14.3. The
predictions-table
p
is used by the real-time abuse detection process, denoted
RT-Abuse-Dtct(p)
in Equation 14.4, to produce the
abusive-log-iles
p
for
p
. The
abu-
sive-log-iles
p
contain the IDs of the traffic entries in
log-iles
p
identified as abusive.
The
abusive-log-iles
p
are joined with the
log-iles
p
by
Est(p)
to disregard the abusive
traffic entries, and produce estimates based solely on legitimate traffic (Equation
14.2). While this joining makes estimation exclusively based on nonabusive traffic,
care should be taken to avoid overfiltering of legitimate traffic.
This overfiltering caveat is best clarified by an example. Let IP 10.1.1.1 be stable
at an estimated size of 1 for the periods
p
−
w
through
p
− 1, and then suddenly
observes a flash crowd during period
p. Prd
(
p
+ 1), which runs during period
p
, is
agnostic to this flash crowd and predicts a size of 1 for period
p
+ 1. Hence,
RT-Abuse-
Dtct
(
p
+ 1) filters the majority of the traffic from 10.1.1.1. When the
log-iles
p
+1
and
the
abusive
-
log
-
iles
p
+1
are joined, most of the traffic from this IP is not considered
for estimation, and
Est
(
p
+ 1) underestimates its size. Since the
estimates-table
p
+1
are
fed back into
Prd
(
p
+ 3), 10.1.1.1 continues to have a small predicted size, and to be
overfiltered in
p
+ 3. To mitigate overfiltering caused by this hysteresis loop, estima-
tion only disregards the egregiously abusive traffic.†
†
The estimation and prediction phases have been assumed so far to run together in
less than
l
= |
p
|, the period length. This introduced a lookahead delay of 2
l
. That is,
*
The length,
w
, of the estimates window should be long enough to span cycles in the activities of the IPs
such that
Prd
(.) considers legitimate cyclic size changes. Conversely, w should not be excessively large
not to include very old sizes unrepresentative of future sizes. In our system, the estimates window was
set to span several weekly cycles.
†
Egregious traffic has been defined in as the traffic that was filtered by another fraud detection filter
already deployed at Google. However, as a guideline if this is the only deployed filter, egregious traf-
fic can be defined as the traffic filtered using a threshold
h
times higher than the normal threshold for
the size of the source IP, where
h
> 1. Selecting
h
involves a tradeoff. As
h
increases, filtering abusive
traffic is reduced, which could later contribute to overestimating sizes of abusive IPs. Building attacks
slowly over time exploits this vulnerability. As
h
decreases, the filter becomes less vulnerable, but pro-
duces more false positives since the estimation cycle becomes less responsive to unforeseen legitimate
changes in sizes.
Search WWH ::
Custom Search