Databases Reference
In-Depth Information
Process stage
Processing Big Data has several substages, and the data transformation at each substage is significant
to produce the correct or incorrect output.
Context processing
Context processing relates to exploring the context of occurrence of data within the unstructured or
Big Data environment. The relevancy of the context will help the processing of the appropriate meta-
data and master data set with the Big Data. The biggest advantage of this kind of processing is the
ability to process the same data for multiple contexts, and then looking for patterns within each result
set for further data mining and data exploration.
Care should be taken to process the right context for the occurrence. For example, consider
the abbreviation “ha” used by all doctors. Without applying the context of where the pattern occurred, it is
easily possible to produce noise or garbage as output. If the word occurred in the notes of a heart specialist,
it will mean “heart attack” as opposed to a neurosurgeon who will have meant “headache.”
You can apply several rules for processing on the same data set based on the contextualization and
the patterns you will look for. The next step after contextualization of data is to cleanse and standard-
ize data with metadata, master data, and semantic libraries as the preparation for integrating with the
data warehouse and other applications. This is discussed in the next section.
Metadata, master data, and semantic linkage
The most important step in creating the integration of Big Data into a data warehouse is the ability
to use metadata, semantic libraries, and master data as the integration links. This step is initi-
ated once the data is tagged and additional processing such as geocoding and contextualization are
completed. The next step of processing is to link the data to the enterprise data set. There are many
techniques to link the data between structured and unstructured data sets with metadata and mas-
ter data. This process is the first important step in converting and integrating the unstructured and
raw data into a structured format.
Linkage of different units of data from multiple data sets is not a new concept by itself. Figure 11.6
shows a common kind of linkage that is foundational in the world of relational data—referential
integrity.
Referential integrity provides the primary key and foreign key relationships in a traditional data-
base and also enforces a strong linking concept that is binary in nature, where the relationship exists
or does not exist.
Figure 11.6 shows the example of departments and employees in any company. If John Doe is an
employee of the company, then there will be a relationship between the employee and the department
to which he belongs. If John Doe is actively employed, then there is a strong relationship between the
employee and department. If he has left or retired from the company, there will be historical data for
him but no current record between the employee and department data. The model shows the relation-
ship that John Doe has with the company, whether he is either an employee or not, where the prob-
ability of a relationship is either 1 or 0, respectively.
When we examine the data from the unstructured world, there are many probabilistic links that
can be found within the data and its connection to the data in the structured world. This is the primary
difference between the data linkage in Big Data and the RDBMS data.
Search WWH ::




Custom Search