Databases Reference
In-Depth Information
Defining the labels
Labels are what Ian considered to be the “neglected” half of the data.
In undergrad statistics education and in data mining competitions, the
availability of labels is often taken for granted. But in reality, labels are
tough to define and capture, while at the same time they are vitally
important. It's not related to just the objective function; it is the
objective.
In Square's setting, defining the label means being precise about:
• What counts as a suspicious activity?
• What is the right level of granularity? An event or an entity (or
both)?
• Can we capture the label reliably? What other systems do we need
to integrate with to get this data?
Lastly, Ian briefly mentioned that label noise can acutely affect pre‐
diction problems with high class imbalance (e.g., very few positive
samples).
Challenges in features and learning
Ian says that features codify your domain knowledge. Once a machine
learning pipeline is up and running, most of the modeling energy
should be spent trying to figure out better ways to describe the domain
(i.e., coming up with new features). But you have to be aware of when
these features can actually be learned.
More precisely, when you are faced with a class imbalanced problem,
you have to be careful about overfitting. The sample size required to
learn a feature is proportional to the population of interest (which, in
this case, is the “fraud” class).
For example, it can get tricky dealing with categorical variables with
many levels. While you may have a zip code for every seller, you don't
have enough information in knowing the zip code alone because so
few fraudulant sellers share zip codes. In this case, you want to do some
clever binning of the zip codes. In some cases, Ian and his team create
a submodel within a model just to reduce the dimension of certain
features.
There's a second data sparsity issue, which is the cold start problem
with new sellers. You don't know the same information for all of your
Search WWH ::




Custom Search