Application Architectures for Big Data and Analytics - Big Data Imperatives

Databases Reference

In-Depth Information

In many cases, big data involves some form of textual or unstructured data. Quality

issues that plague text from user-entered data largely applies to big data initiatives. The

following examples represent typical data quality challenges relating to text that should

be extended into big data environments:

•

Identifying misspelled words or managing synonym lists for

grouping similar items like “lvm,” “left voice mail,” “left a message,”

etc., that may affect analysis.

•

Leveraging content categorization to ensure that the textual data

is relevant. For example, filtering out noise in textual data relating

to a company name: differentiating SAS Institute, SAS shoes, SAS

the airline, etc.

•

Utilizing contextual intelligence to discern meaning. For example

differentiation between the person and the name of a hotel, “Paris

Hilton walks into the Paris Hilton.” This should include the ability

to factor this into count or summary analysis where it is necessary

to delineate between the person and place.

There are several other considerations for data quality for big data scenarios listed below:

Consider the type of data: The data quality requirements for different forms of

data will vary and your approach should match the needs of the data. For example:

Big data projects that relate to traditional forms of data like

•

transaction data related to key entities like customers, products,

etc., can leverage existing data quality processes as long as it

scales to meet the needs of massive volume.

•

Big data originating from machines or sensor data (e.g., RFID

tags, manufacturing sensor data, telecom networks/switches,

utilities, etc.) will not be prone to errors as compared to data that

is entered by humans. As additional sensor data streams in you

need to ascertain the difference between signals and noise: for

example, a pigeon sitting on a sensor causing the data to throw up

random alarms.

•

Social media data such as Twitter, Facebook, etc., may look highly

unstructured, but they still contain a structure around it: a meta-

data description defining type of tweet stream and then the text

string that contains the content of the tweet. From a data quality

perspective, this will involve a combination of entity matching,

monitoring to ensure that the tweet stream is not interrupted

along with the ability to analyze the text.

Not all analysis requires exactness: If you are attempting to identify a general

pattern and you have a lot of data, the volume of data is not likely to impact the overall

conclusion. For example, if you have a massive amount of click stream data and you

are looking for patterns (when people leave a site, which path is more likely to result in

purchase or conversion, etc.) the outliers will not impact the overall conclusion. In this

case, it's more of an analytics process versus a data quality process. However, you will still

have to check the relevance aspects of data: for example, if someone accidentally ends

Big Data Imperatives

Search WWH ::

Custom Search

Home