Data-Driven Architecture for Big Data - Data Warehousing in the Age of Big Data

Databases Reference

In-Depth Information

Process stage

Processing Big Data has several substages, and the data transformation at each substage is significant

to produce the correct or incorrect output.

Context processing

Context processing relates to exploring the context of occurrence of data within the unstructured or

Big Data environment. The relevancy of the context will help the processing of the appropriate meta-

data and master data set with the Big Data. The biggest advantage of this kind of processing is the

ability to process the same data for multiple contexts, and then looking for patterns within each result

set for further data mining and data exploration.

Care should be taken to process the right context for the occurrence. For example, consider

the abbreviation “ha” used by all doctors. Without applying the context of where the pattern occurred, it is

easily possible to produce noise or garbage as output. If the word occurred in the notes of a heart specialist,

it will mean “heart attack” as opposed to a neurosurgeon who will have meant “headache.”

You can apply several rules for processing on the same data set based on the contextualization and

the patterns you will look for. The next step after contextualization of data is to cleanse and standard-

ize data with metadata, master data, and semantic libraries as the preparation for integrating with the

data warehouse and other applications. This is discussed in the next section.

Metadata, master data, and semantic linkage

The most important step in creating the integration of Big Data into a data warehouse is the ability

to use metadata, semantic libraries, and master data as the integration links. This step is initi-

ated once the data is tagged and additional processing such as geocoding and contextualization are

completed. The next step of processing is to link the data to the enterprise data set. There are many

techniques to link the data between structured and unstructured data sets with metadata and mas-

ter data. This process is the first important step in converting and integrating the unstructured and

raw data into a structured format.

Linkage of different units of data from multiple data sets is not a new concept by itself. Figure 11.6

shows a common kind of linkage that is foundational in the world of relational data—referential

integrity.

Referential integrity provides the primary key and foreign key relationships in a traditional data-

base and also enforces a strong linking concept that is binary in nature, where the relationship exists

or does not exist.

Figure 11.6 shows the example of departments and employees in any company. If John Doe is an

employee of the company, then there will be a relationship between the employee and the department

to which he belongs. If John Doe is actively employed, then there is a strong relationship between the

employee and department. If he has left or retired from the company, there will be historical data for

him but no current record between the employee and department data. The model shows the relation-

ship that John Doe has with the company, whether he is either an employee or not, where the prob-

ability of a relationship is either 1 or 0, respectively.

When we examine the data from the unstructured world, there are many probabilistic links that

can be found within the data and its connection to the data in the structured world. This is the primary

difference between the data linkage in Big Data and the RDBMS data.

Search WWH ::

Custom Search

Home