Databases Reference
In-Depth Information
When looking at underutilized data, quality issues can take you through nasty
discoveries, so it pays to expect the unexpected. For example, in many cases you may
find that the system data provided as a reference is highly variable and not as described
in the specifications. In cases like this, you either need to go back and deal with the core
system data generation process or work past the quality issues. This is a fairly common
occurrence since, by definition, when you are dealing with underutilized information
sources, this may be the first time they have been put to rigorous use.
This issue rises to a new level of complexity when you're combining structured data
with unstructured sources that—it almost goes without saying—are rarely managed as
official systems of record. In fact, when dealing with unstructured information (which
is the most important new source of big data), expect the data to be fuzzy, inconsistent,
and noisy. A growing range of big data sources provide non-transactional data—event,
geo-spatial, behavioral, click stream, social, sensor, and so on—that is fuzzy and noisy by
its very nature. Establishing a corporate standard and shared method for processing this
data through a single system is a very good idea.
Interestingly, big data is ideally suited to resolve one of the data quality issues
that has long impacted the statistical analyses: the traditional need to build models
on training samples rather than on the entire population of data records. This idea is
important but under-appreciated. The scalability constraints of analytic data platforms
have historically forced modelers to give up granularity in the data set in order to speed
up model building, execution, and scoring. Not having the complete data population at
your disposal means that you may completely overlook outlier records and, as a result,
risk skewing your analysis only to the records that survived the cut.
This isn't a data quality problem (the data in the source and in the sample may be
perfectly accurate and up to date) as much as a loss of data resolution downstream when
you knowingly filter out the sparse/outlier records. Let's look at a specific example in the
messy social listening space. It's easy to manage noisy or bad data when you are dealing
with general discussion about a topic. The volume of activity here usually takes care of
outliers, and you are—by definition—listening to customers. Data comes from many
sources so you can probably trust (but verify through sensitivity analysis) that missing
or bad data won't cause a misinterpretation of what people mean. However, when you
examine what a particular customer is saying and then decide how you should respond to
that individual, missing or bad data becomes much more problematic. It may or may not
be terminal in that analytics run, but it inherently presents more of a challenge. You need
to know the impact of getting it wrong and design accordingly.
Your data quality efforts need to be defined more as profiling and standards versus
cleansing. This is better aligned to how big data is managed and processed. While on the
surface, big data processing is batch in nature, it would seem obvious to institute data quality
rules the way they have always been done. But the answer is to be more service-oriented,
invoking data quality rules that provide improved standardization and sourcing during
processing versus fundamentally changing the data. In addition, data quality rules are
invoked in a customized fashion based on customer service calls from big data processing.
Why this also makes sense is that when you do decide to persist sourced big data
into your internal infrastructure, you have pre-aligned the data to existing policies for
integration and business rules for improved mapping and cleansing that would need to
persist. In essence you treat big data as a reference source, not a primary source. So, think
about data quality in the context of supporting preprocessing with Hadoop and map-
reduce through profiling and standards, not cleansing.
 
Search WWH ::




Custom Search