Databases Reference
In-Depth Information
Analytics
DB
Integration
Landing
Zone
Ingestion
Process
Discover
Operational
Reporting
Gather
Data
Tr ansform
Data
Load Data
Raw Data
Extracts
Extact Data
FIGURE 3.6
Big Data processing flow.
Fault tolerance
Auto recovery
High degree of parallelism
Distributed data processing
Programming language interface
The key element that is not required for Big Data is the need for a relational database to provide
the backend platform for data processing.
Interestingly, the architecture and infrastructure requirements for Big Data processing are
closely aligned to web application architecture. Furthermore, there are several data processing tech-
niques on file-based architectures including the operating systems that have matured over the last
30 years. Combining these techniques, a highly scalable and performing platform can be designed
and deployed.
To design an efficient infrastructure and processing architecture, we need to understand the data-
flow for processing Big Data. A high-level overview of Big Data processing is shown in Figure 3.6 .
There are four distinct stages of processing and each stage's requirement for infrastructure remains
the same. Let us look at the processing that occurs in each stage.
Gather data . In this stage, the data is received from different sources and loaded to a file system
called the landing zone or landing area. Typically, the data is sorted into subdirectories based on
the data type. Any file modifications like naming or extension changes can be completed in this
stage.
Load data . In this stage, the data is loaded with the application of metadata (this is the stage
where you will apply a structure for the first time to the data) and readied for transformation.
The loading process breaks down the large input into small chunks of files. A catalog of the
 
Search WWH ::




Custom Search