Large-Scale Network Traffic Analysis for Estimating the Size of IP Addresses and Detecting Traffic Anomalies - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

on. In general, an entity is any dimension that aggregates ad events. For each type of

entity, a detection system based on the IP size distribution can be built. This is useful

to build several complementary defense mechanisms that protect against different

types of attacks.

14.5.5.1 Flagging Entities—System Overview

Figure 14.13 illustrates the workflow of the system implemented at Google. The first

step is the estimation of the expected IP size distribution of each entity. Each group

might have a different IP size distribution. However, entities within the same group

are expected to share a similar distribution. Since the majority of abusive clicks

are already filtered out by existing detection systems, the aggregate distribution of

legitimate IP sizes within each group is used as an estimation of the true IP size

distribution for that group. Next, multiple statistical methods are used to accurately

characterize the deviation between the observed and the expected distribution. As

noted in Figure 14.8, different attacks result in different deviations in the IP size dis-

tribution. Finally, an ensemble-learning model [23] is used to combine the outcome

of these methods in a signature vector specific to each entity. A regression model is

built that identifies and classifies signatures associated with fraudulent entities.

14.5.5.2 Combining Statistical Methods

To characterize the deviation between the observed and the expected distribution

of each entity, an ensemble of different statistical methods is used. These can be

Geographical DB

Observed

and expected

distribution

Signature

vectors

Group of

publishers

Predict

fraud score

Click

logs

Statistical

tests to

compare

distributions

Feature

extractor

Sharder

IP Size DB

Group of

publishers

FIGURE 14.13 Flagging entities—system overview: the click logs and the information pro-

vided by Google IP size and Geographical databases are fed as input. The feature extractor

module extracts only the features of interest, as discussed in Section 14.5.3. Next, the sharder

partitions the data into groups based on the type of entity, the type of connecting device,

and the geolocation of the source IP. For each of these groups, an expected distribution, r , is

built from the historical data of legitimate clicks. For each entity, the observed distribution

of IP sizes, f = f ( P ) is computed. The observed and expected distribution are compared using

several statistical methods. Finally, these results are combined in a signature vector specific

to each entity. The vector is used to predict the entity's fraud score.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home