Databases Reference
In-Depth Information
Artifacts relating to each data element, including business rules and value mappings,
must also be recorded. If data is mapped or cleansed, care must be taken not to lose the
original values. Data element profiles must be created. The profiles should record the
completeness of every record. Because data may migrate across systems, controls and
reconciliation criteria need to be created and recorded to ensure that data sets accurately
reflect the data at the point of acquisition and that no data was lost or duplicated in the
process.
Special care must be given to unstructured and semi-structured data because data
quality attributes and artifacts may not be easily or readily defined. If structured data is
created from unstructured and semi-structured data, the creation process must also be
documented and any of the previously noted data quality processes applied.
In a big data scenario, you must create data-quality metadata that includes data
quality attributes, measures, business rules, mappings, cleansing routines, data element
profiles, and controls.
High Availability versus High Data Quality
Typically, big data solutions are designed to ensure high availability. High availability
is based on the concept that it is more important to collect and store data transactions
than it is to determine the uniqueness or accuracy of the transaction. Some common
examples of big data/high availability solutions are Twitter and Facebook.
It is possible to configure a big data solution to validate uniqueness and accuracy.
However, in order to do so you need to sacrifice some of the aspects of high availability.
So, in some regard, big data and data quality are at odds.
This is because one of the fundamental aspects of high availability is to write
transactions to whichever node is available. In this model, consistency of transactional
data is sacrificed in the name of data capture. Most often, consistency is eventually
configured for queries or on data reads as opposed to data writes.
In other words, at some given point in time you do not have consistency in a big data
set. Even more troubling is the fact that most transactional conflicts are resolved based on
timestamps. This is to say that the most recently updated transaction is commonly regarded
as the most accurate. This approach is, obviously, an issue that requires further examination.
Why we don't see an inherent trade-off between the volume of a data set and the
quality of the data maintained within it?
We are under the mistaken impression that there's an inherent trade-off between the
volume of a data set and the quality of the data maintained within it. In essence, big data
sets are big, and hence it is natural to deduce that there is a good amount of inconsistent,
inaccurate, redundant, out of date, or un-conformed junk data. This way of thinking may
have some merit; however, let's understand the reality. When you talk about big data,
you're usually talking about more volume, more velocity, and more variety. Of course, that
means you're also likely to see more low-quality data records than in smaller data sets.
But that's simply a matter of the greater scale of big data sets, rather than a higher
incidence of quality problems. While it is true that a 1 percent data quality issue is
numerically far worse at 1 billion records as opposed to 1 million, the overall percentage
remains the same, and its impact on the resulting analytics is consistent. Under such
circumstances, dealing with the data cleanup may require more effort—but as we noted
earlier, that's exactly the sort of workload scaling where big data platforms excel.
 
Search WWH ::




Custom Search