Information Technology Reference
In-Depth Information
21.3 Additional Details on Big Data Technologies
21.3.1 Processing Approach
Current big data computing platforms use a divide and conquer parallel pro-
cessing approach combining multiple processors and disks in large com-
puting clusters connected using high-speed communications switches and
networks that allows the data to be partitioned among the available com-
puting resources and processed independently to achieve performance and
scalability based on the amount of data (Figure 5.1). We define a cluster as
“a type of parallel and distributed system, which consists of a collection of
inter-connected stand-alone computers working together as a single inte-
grated computing resource.”
This approach to parallel processing is often referred to as a shared-nothing
approach since each node consisting of processor, local memory, and disk
resources shares nothing with other nodes in the cluster. In parallel comput-
ing, this approach is considered suitable for data processing problems that
are embarrassingly parallel , that is, where it is relatively easy to separate the
problem into a number of parallel tasks and there is no dependency or com-
munication required between the tasks other than overall management of
the tasks. These types of data processing problems are inherently adaptable
to various forms of distributed computing including clusters and data grids
and cloud computing.
21.3.2 Big Data System Architecture
A variety of system architectures have been implemented for big data
and large-scale data analysis applications including parallel and distrib-
uted relational database management systems that have been available
to run on shared-nothing clusters of processing nodes for more than two
decades. These include database systems from Teradata, Netezza, Vertica,
and Exadata/Oracle, and others, which provide high-performance parallel
database platforms. Although these systems have the ability to run paral-
lel applications and queries expressed in the SQL, they are typically not
general-purpose processing platforms and usually run as a back-end to a
separate front-end application processing system.
Although this approach offers benefits when the data utilized are primar-
ily structured in nature and fits easily into the constraints of a relational
database, and often excels for transaction processing applications, most data
growth is with data in unstructured form and new processing paradigms
with more flexible data models were needed. Internet companies such as
Google, Yahoo, Microsoft, Facebook, and others required a new process-
ing approach to effectively deal with the enormous amount of Web data
for applications such as search engines and social networking. In addition,
Search WWH ::




Custom Search