Introducing Big Data Technologies - Data Warehousing in the Age of Big Data

Databases Reference

In-Depth Information

DB

Geo 3

DB

Geo 1

Primary

DB

Geo 2

DB

Geo 4

FIGURE 4.1

Distributed data processing in the RDBMS.

include MongoDB, Neo4J, Riak, Amazon DynamoDB, MemcachedDB, BerkleyDB, Voldemort, and

many more. Though many of these platforms were originally developed and deployed for solving the

data processing needs of web applications and search engines, they have evolved to support other data

processing requirements. In the rest of this chapter, the intent is to provide you with how data processing

is managed by these platforms. This chapter is not a tutorial for step-by-step configuration and usage of

these technologies. There are also references provided at the end for further reading and reference.

Distributed data processing

Before we proceed to understand how Big Data technologies work and see associated reference archi-

tectures, let us recap distributed data processing.

Distributed data processing has been in existence since the late 1970s. The primary concept was

to replicate the DBMS in a master-slave configuration and process data across multiple instances

( Figure 4.1 ). Each slave would engage in a two-phase commit with its master in a query processing

situation. Several papers exist on the subject and how its early implementations have been designed,

authored by Dr. Stonebraker 1 , Teradata, University of California at Berkley departments, and others.

Several commercial and early open-source DBMS systems have addressed large-scale data pro-

cessing with distributed data management algorithms, however, they all faced problems in the areas

of concurrency, fault tolerance, supporting multiple redundant copies of data, and distributed process-

ing of programs. A bigger barrier was the cost of infrastructure.

Why did distributed data processing fail to meet the requirements in the relational data process-

ing architecture? It can be called a hit or miss depending on the complexity of the architecture. The

answer to this question lies in multiple dimensions:

●

Dependency on RDBMS:

●

ACID (atomicity, consistency, isolation, and durability) compliance for transaction

management

●

Complex architectures for consistency management

●

Latencies across the system

1 DeWitt, D. J., & Stonebraker, M. (2008). MapReduce: a major step backwards. The Database Column , ( http://

homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html ) .

Data Warehousing in the Age of Big Data

Search WWH ::

Custom Search

Home