Database Reference
In-Depth Information
Table 1-1. RDBMS compared to MapReduce
Traditional RDBMS
MapReduce
Data size
Gigabytes
Petabytes
Access
Interactive and batch
Batch
Updates
Read and write many times Write once, read many times
Transactions
ACID
None
Structure
Schema-on-write
Schema-on-read
Integrity
High
Low
Scaling
Nonlinear
Linear
However, the differences between relational databases and Hadoop systems are blurring.
Relational databases have started incorporating some of the ideas from Hadoop, and from
the other direction, Hadoop systems such as Hive are becoming more interactive (by mov-
ing away from MapReduce) and adding features like indexes and transactions that make
them look more and more like traditional RDBMSs.
Another difference between Hadoop and an RDBMS is the amount of structure in the
datasets on which they operate. Structured data is organized into entities that have a
defined format, such as XML documents or database tables that conform to a particular
predefined schema. This is the realm of the RDBMS. Semi-structured data , on the other
hand, is looser, and though there may be a schema, it is often ignored, so it may be used
only as a guide to the structure of the data: for example, a spreadsheet, in which the struc-
ture is the grid of cells, although the cells themselves may hold any form of data. Unstruc-
tured data does not have any particular internal structure: for example, plain text or image
data. Hadoop works well on unstructured or semi-structured data because it is designed to
interpret the data at processing time (so called schema-on-read ). This provides flexibility
and avoids the costly data loading phase of an RDBMS, since in Hadoop it is just a file
copy.
Relational data is often normalized to retain its integrity and remove redundancy. Normal-
ization poses problems for Hadoop processing because it makes reading a record a non-
local operation, and one of the central assumptions that Hadoop makes is that it is possible
to perform (high-speed) streaming reads and writes.
A web server log is a good example of a set of records that is not normalized (for ex-
ample, the client hostnames are specified in full each time, even though the same client
may appear many times), and this is one reason that logfiles of all kinds are particularly
Search WWH ::




Custom Search