Databases Reference
In-Depth Information
Running faster won't get you to the right place if you don't know where you're going.
By creating better, faster, and more robust means of accessing and analyzing large data
sets can lead to erroneous outcomes if your data management and data quality processes
don't keep pace.
Traditionally, whenever data quality concerns were raised we always asked the
following questions:
What are the data quality benchmarks to ensure the data will be
fit for its intended use?
What are the key data qualities attributes to be measured
(for example, validity, accuracy, timeliness, reasonableness,
completeness)?
What approaches will we take to manage data quality? For
example, should we fix issues at the source or have a cleansed and
quality assured environment downstream?
How do we capture the data lineage and traceability (for example,
data flows from the underlying business processes) aspects?
Will these traditional methods be relevant for big data scenarios, or we will need new
principles and processes? What are the data management and data quality implications
of these technologies?
Let us discuss few of the critical aspects related to the data life cycle and big data
implications that heavily influence data quality.
Metadata. Metadata is important to any data management activity. Metadata and
metadata management become even more important when dealing with large, complex,
and often multi-sourced data sets. Metadata to be used across the enterprise must be
clear and easily interpreted and must apply at a very basic level.
Data Element Classification. For big data quality and management (big DQ and
DM), minimum metadata requirements need to be established and, ultimately, metadata
standards too. To foster cross-enterprise use of data, taxonomies (classification or
categorical structures) need to be defined, such as demographic data, financial data,
geographic/geospatial data, property characteristics, and personal identifiable information.
Data Acquisition. While acquiring data, it is critical for data to be organized to be
more readily assessable. Data exchange standards for big DQ and DM are key aspects
in the acquisition process. Use of the common vocabulary and definitions facilitates the
mapping of data across sources.
Data Ingestion and Integration. Integrating data across multiple sources is certainly
a large part of a big data effort. One school of thought is to create a “data lake” where you
dump data coming from various sources, and then later on as you start using the data, you
define standards, establish lineage, and create metadata definitions. While this approach
significantly reduces the process-related bottlenecks, it also creates concerns around
quality of data. Usage of tools and processes like MDM, entity resolution, and identity
management will surely help to address some of the data- quality-related concerns.
While data quality has traditionally been measured in relation to its intended use,
for big data projects, data quality may have to be assessed beyond its intended use
and one may have to address how data can be repurposed. To do so, data quality
attributes—validity, accuracy, timeliness, reasonableness, completeness, and so
forth—must be clearly defined, measured, recorded, and made available to end users.
 
Search WWH ::




Custom Search