Application Architectures for Big Data and Analytics - Big Data Imperatives

Databases Reference

In-Depth Information

Artifacts relating to each data element, including business rules and value mappings,

must also be recorded. If data is mapped or cleansed, care must be taken not to lose the

original values. Data element profiles must be created. The profiles should record the

completeness of every record. Because data may migrate across systems, controls and

reconciliation criteria need to be created and recorded to ensure that data sets accurately

reflect the data at the point of acquisition and that no data was lost or duplicated in the

process.

Special care must be given to unstructured and semi-structured data because data

quality attributes and artifacts may not be easily or readily defined. If structured data is

created from unstructured and semi-structured data, the creation process must also be

documented and any of the previously noted data quality processes applied.

In a big data scenario, you must create data-quality metadata that includes data

quality attributes, measures, business rules, mappings, cleansing routines, data element

profiles, and controls.

High Availability versus High Data Quality

Typically, big data solutions are designed to ensure high availability. High availability

is based on the concept that it is more important to collect and store data transactions

than it is to determine the uniqueness or accuracy of the transaction. Some common

examples of big data/high availability solutions are Twitter and Facebook.

It is possible to configure a big data solution to validate uniqueness and accuracy.

However, in order to do so you need to sacrifice some of the aspects of high availability.

So, in some regard, big data and data quality are at odds.

This is because one of the fundamental aspects of high availability is to write

transactions to whichever node is available. In this model, consistency of transactional

data is sacrificed in the name of data capture. Most often, consistency is eventually

configured for queries or on data reads as opposed to data writes.

In other words, at some given point in time you do not have consistency in a big data

set. Even more troubling is the fact that most transactional conflicts are resolved based on

timestamps. This is to say that the most recently updated transaction is commonly regarded

as the most accurate. This approach is, obviously, an issue that requires further examination.

Why we don't see an inherent trade-off between the volume of a data set and the

quality of the data maintained within it?

We are under the mistaken impression that there's an inherent trade-off between the

volume of a data set and the quality of the data maintained within it. In essence, big data

sets are big, and hence it is natural to deduce that there is a good amount of inconsistent,

inaccurate, redundant, out of date, or un-conformed junk data. This way of thinking may

have some merit; however, let's understand the reality. When you talk about big data,

you're usually talking about more volume, more velocity, and more variety. Of course, that

means you're also likely to see more low-quality data records than in smaller data sets.

But that's simply a matter of the greater scale of big data sets, rather than a higher

incidence of quality problems. While it is true that a 1 percent data quality issue is

numerically far worse at 1 billion records as opposed to 1 million, the overall percentage

remains the same, and its impact on the resulting analytics is consistent. Under such

circumstances, dealing with the data cleanup may require more effort—but as we noted

earlier, that's exactly the sort of workload scaling where big data platforms excel.

Big Data Imperatives

Search WWH ::

Custom Search

Home