Application Architectures for Big Data and Analytics - Big Data Imperatives

Databases Reference

In-Depth Information

When looking at underutilized data, quality issues can take you through nasty

discoveries, so it pays to expect the unexpected. For example, in many cases you may

find that the system data provided as a reference is highly variable and not as described

in the specifications. In cases like this, you either need to go back and deal with the core

system data generation process or work past the quality issues. This is a fairly common

occurrence since, by definition, when you are dealing with underutilized information

sources, this may be the first time they have been put to rigorous use.

This issue rises to a new level of complexity when you're combining structured data

with unstructured sources that—it almost goes without saying—are rarely managed as

official systems of record. In fact, when dealing with unstructured information (which

is the most important new source of big data), expect the data to be fuzzy, inconsistent,

and noisy. A growing range of big data sources provide non-transactional data—event,

geo-spatial, behavioral, click stream, social, sensor, and so on—that is fuzzy and noisy by

its very nature. Establishing a corporate standard and shared method for processing this

data through a single system is a very good idea.

Interestingly, big data is ideally suited to resolve one of the data quality issues

that has long impacted the statistical analyses: the traditional need to build models

on training samples rather than on the entire population of data records. This idea is

important but under-appreciated. The scalability constraints of analytic data platforms

have historically forced modelers to give up granularity in the data set in order to speed

up model building, execution, and scoring. Not having the complete data population at

your disposal means that you may completely overlook outlier records and, as a result,

risk skewing your analysis only to the records that survived the cut.

This isn't a data quality problem (the data in the source and in the sample may be

perfectly accurate and up to date) as much as a loss of data resolution downstream when

you knowingly filter out the sparse/outlier records. Let's look at a specific example in the

messy social listening space. It's easy to manage noisy or bad data when you are dealing

with general discussion about a topic. The volume of activity here usually takes care of

outliers, and you are—by definition—listening to customers. Data comes from many

sources so you can probably trust (but verify through sensitivity analysis) that missing

or bad data won't cause a misinterpretation of what people mean. However, when you

examine what a particular customer is saying and then decide how you should respond to

that individual, missing or bad data becomes much more problematic. It may or may not

be terminal in that analytics run, but it inherently presents more of a challenge. You need

to know the impact of getting it wrong and design accordingly.

Your data quality efforts need to be defined more as profiling and standards versus

cleansing. This is better aligned to how big data is managed and processed. While on the

surface, big data processing is batch in nature, it would seem obvious to institute data quality

rules the way they have always been done. But the answer is to be more service-oriented,

invoking data quality rules that provide improved standardization and sourcing during

processing versus fundamentally changing the data. In addition, data quality rules are

invoked in a customized fashion based on customer service calls from big data processing.

Why this also makes sense is that when you do decide to persist sourced big data

into your internal infrastructure, you have pre-aligned the data to existing policies for

integration and business rules for improved mapping and cleansing that would need to

persist. In essence you treat big data as a reference source, not a primary source. So, think

about data quality in the context of supporting preprocessing with Hadoop and map-

reduce through profiling and standards, not cleansing.

Big Data Imperatives

Search WWH ::

Custom Search

Home