Database Reference
In-Depth Information
Pr
d
()
p
ipw
p
2
estimates table
-
-—
predictions
- table p
(14.3)
=−−
1
i
−−− −−−−−−−− ()
RT AbuseDtct p
-
-
traffic p IP predictions table p
-
abusive
--
log
files p
(14.4)
Real-time logging, denoted RT - Log(p) in Equation 14.1, finalizes the traffic log -
iles p as p + 1 starts. Next, the log-iles p are consumed, among other input, by the
estimation process, Est(p) , to produce the estimates-table p mapping IPs that issued
traffic during p to their estimated sizes (Equation 14.2). Next, the algorithm for pre-
dicting sizes, Prd ( p + 2), consumes the estimates-tables from a sliding window of
length w periods,* p - w + 1 through p , to produce the predictions-table p +2 . This
prediction process, Prd ( p + 2), is assumed to complete before p + 2, and produce
predictions-table p +2 , mapping IPs to their predicted sizes of period p + 2.
The estimates-tables that contributed to predictions-table p are shown in Equation
14.3. The predictions-table p is used by the real-time abuse detection process, denoted
RT-Abuse-Dtct(p) in Equation 14.4, to produce the abusive-log-iles p for p . The abu-
sive-log-iles p contain the IDs of the traffic entries in log-iles p identified as abusive.
The abusive-log-iles p are joined with the log-iles p by Est(p) to disregard the abusive
traffic entries, and produce estimates based solely on legitimate traffic (Equation
14.2). While this joining makes estimation exclusively based on nonabusive traffic,
care should be taken to avoid overfiltering of legitimate traffic.
This overfiltering caveat is best clarified by an example. Let IP 10.1.1.1 be stable
at an estimated size of 1 for the periods p w through p − 1, and then suddenly
observes a flash crowd during period p. Prd ( p + 1), which runs during period p , is
agnostic to this flash crowd and predicts a size of 1 for period p + 1. Hence, RT-Abuse-
Dtct ( p + 1) filters the majority of the traffic from 10.1.1.1. When the log-iles p +1 and
the abusive - log - iles p +1 are joined, most of the traffic from this IP is not considered
for estimation, and Est ( p + 1) underestimates its size. Since the estimates-table p +1 are
fed back into Prd ( p + 3), 10.1.1.1 continues to have a small predicted size, and to be
overfiltered in p + 3. To mitigate overfiltering caused by this hysteresis loop, estima-
tion only disregards the egregiously abusive traffic.†
The estimation and prediction phases have been assumed so far to run together in
less than l = | p |, the period length. This introduced a lookahead delay of 2 l . That is,
* The length, w , of the estimates window should be long enough to span cycles in the activities of the IPs
such that Prd (.) considers legitimate cyclic size changes. Conversely, w should not be excessively large
not to include very old sizes unrepresentative of future sizes. In our system, the estimates window was
set to span several weekly cycles.
Egregious traffic has been defined in as the traffic that was filtered by another fraud detection filter
already deployed at Google. However, as a guideline if this is the only deployed filter, egregious traf-
fic can be defined as the traffic filtered using a threshold h times higher than the normal threshold for
the size of the source IP, where h > 1. Selecting h involves a tradeoff. As h increases, filtering abusive
traffic is reduced, which could later contribute to overestimating sizes of abusive IPs. Building attacks
slowly over time exploits this vulnerability. As h decreases, the filter becomes less vulnerable, but pro-
duces more false positives since the estimation cycle becomes less responsive to unforeseen legitimate
changes in sizes.
Search WWH ::




Custom Search