Database Reference
In-Depth Information
on. In general, an entity is any dimension that aggregates ad events. For each type of
entity, a detection system based on the IP size distribution can be built. This is useful
to build several complementary defense mechanisms that protect against different
types of attacks.
14.5.5.1 Flagging Entities—System Overview
Figure 14.13 illustrates the workflow of the system implemented at Google. The first
step is the estimation of the expected IP size distribution of each entity. Each group
might have a different IP size distribution. However, entities within the same group
are expected to share a similar distribution. Since the majority of abusive clicks
are already filtered out by existing detection systems, the aggregate distribution of
legitimate IP sizes within each group is used as an estimation of the true IP size
distribution for that group. Next, multiple statistical methods are used to accurately
characterize the deviation between the observed and the expected distribution. As
noted in Figure 14.8, different attacks result in different deviations in the IP size dis-
tribution. Finally, an ensemble-learning model [23] is used to combine the outcome
of these methods in a signature vector specific to each entity. A regression model is
built that identifies and classifies signatures associated with fraudulent entities.
14.5.5.2 Combining Statistical Methods
To characterize the deviation between the observed and the expected distribution
of each entity, an ensemble of different statistical methods is used. These can be
Geographical DB
Observed
and expected
distribution
Signature
vectors
Group of
publishers
Predict
fraud score
Click
logs
Statistical
tests to
compare
distributions
Feature
extractor
Sharder
IP Size DB
Group of
publishers
FIGURE 14.13 Flagging entities—system overview: the click logs and the information pro-
vided by Google IP size and Geographical databases are fed as input. The feature extractor
module extracts only the features of interest, as discussed in Section 14.5.3. Next, the sharder
partitions the data into groups based on the type of entity, the type of connecting device,
and the geolocation of the source IP. For each of these groups, an expected distribution, r , is
built from the historical data of legitimate clicks. For each entity, the observed distribution
of IP sizes, f = f ( P ) is computed. The observed and expected distribution are compared using
several statistical methods. Finally, these results are combined in a signature vector specific
to each entity. The vector is used to predict the entity's fraud score.
Search WWH ::




Custom Search