Databases Reference
In-Depth Information
FIGURE 11.5
Processing Big Data.
Gather stage
Data is acquired from multiple sources including real-time systems, near-real-time systems, and
batch-oriented applications. The data is collected and loaded to a storage environment like Hadoop or
NoSQL. Another option is to process the data through a knowledge discovery platform and store the
output rather than the whole data set.
Analysis stage
The analysis stage is the data discovery stage for processing Big Data and preparing it for integra-
tion to the structured analytical platforms or the data warehouse. The analysis stage consists of tag-
ging, classification, and categorization of data, which closely resembles the subject area creation data
model definition stage in the data warehouse.
Tagging —a common practice that has been prevalent since 2003 on the Internet for data sharing.
Tagging is the process of applying a term to an unstructured piece of information that will provide
a metadata-like attribution to the data. Tagging creates a rich nonhierarchical data set that can be
used to process the data downstream in the process stage.
Classify —unstructured data comes from multiple sources and is stored in the gathering process.
Classification helps to group data into subject-oriented data sets for ease of processing. For
example, classifying all customer data in one group helps optimize the processing of unstructured
customer data.
Categorize —the process of categorization is the external organization of data from a storage
perspective where the data is physically grouped by both the classification and then the data type.
Categorization will be useful in managing the life cycle of the data since the data is stored as a
write-once model in the storage layer.
 
Search WWH ::




Custom Search