Databases Reference
In-Depth Information
In many cases, big data involves some form of textual or unstructured data. Quality
issues that plague text from user-entered data largely applies to big data initiatives. The
following examples represent typical data quality challenges relating to text that should
be extended into big data environments:
Identifying misspelled words or managing synonym lists for
grouping similar items like “lvm,” “left voice mail,” “left a message,”
etc., that may affect analysis.
Leveraging content categorization to ensure that the textual data
is relevant. For example, filtering out noise in textual data relating
to a company name: differentiating SAS Institute, SAS shoes, SAS
the airline, etc.
Utilizing contextual intelligence to discern meaning. For example
differentiation between the person and the name of a hotel, “Paris
Hilton walks into the Paris Hilton.” This should include the ability
to factor this into count or summary analysis where it is necessary
to delineate between the person and place.
There are several other considerations for data quality for big data scenarios listed below:
Consider the type of data: The data quality requirements for different forms of
data will vary and your approach should match the needs of the data. For example:
Big data projects that relate to traditional forms of data like
transaction data related to key entities like customers, products,
etc., can leverage existing data quality processes as long as it
scales to meet the needs of massive volume.
Big data originating from machines or sensor data (e.g., RFID
tags, manufacturing sensor data, telecom networks/switches,
utilities, etc.) will not be prone to errors as compared to data that
is entered by humans. As additional sensor data streams in you
need to ascertain the difference between signals and noise: for
example, a pigeon sitting on a sensor causing the data to throw up
random alarms.
Social media data such as Twitter, Facebook, etc., may look highly
unstructured, but they still contain a structure around it: a meta-
data description defining type of tweet stream and then the text
string that contains the content of the tweet. From a data quality
perspective, this will involve a combination of entity matching,
monitoring to ensure that the tweet stream is not interrupted
along with the ability to analyze the text.
Not all analysis requires exactness: If you are attempting to identify a general
pattern and you have a lot of data, the volume of data is not likely to impact the overall
conclusion. For example, if you have a massive amount of click stream data and you
are looking for patterns (when people leave a site, which path is more likely to result in
purchase or conversion, etc.) the outliers will not impact the overall conclusion. In this
case, it's more of an analytics process versus a data quality process. However, you will still
have to check the relevance aspects of data: for example, if someone accidentally ends
 
Search WWH ::




Custom Search