Meet Hadoop - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Table 1-1. RDBMS compared to MapReduce

Traditional RDBMS

MapReduce

Data size

Gigabytes

Petabytes

Access

Interactive and batch

Batch

Updates

Read and write many times Write once, read many times

Transactions

ACID

None

Structure

Schema-on-write

Schema-on-read

Integrity

High

Low

Scaling

Nonlinear

Linear

However, the differences between relational databases and Hadoop systems are blurring.

Relational databases have started incorporating some of the ideas from Hadoop, and from

the other direction, Hadoop systems such as Hive are becoming more interactive (by mov-

ing away from MapReduce) and adding features like indexes and transactions that make

them look more and more like traditional RDBMSs.

Another difference between Hadoop and an RDBMS is the amount of structure in the

datasets on which they operate. Structured data is organized into entities that have a

defined format, such as XML documents or database tables that conform to a particular

predefined schema. This is the realm of the RDBMS. Semi-structured data , on the other

hand, is looser, and though there may be a schema, it is often ignored, so it may be used

only as a guide to the structure of the data: for example, a spreadsheet, in which the struc-

ture is the grid of cells, although the cells themselves may hold any form of data. Unstruc-

tured data does not have any particular internal structure: for example, plain text or image

data. Hadoop works well on unstructured or semi-structured data because it is designed to

interpret the data at processing time (so called schema-on-read ). This provides flexibility

and avoids the costly data loading phase of an RDBMS, since in Hadoop it is just a file

copy.

Relational data is often normalized to retain its integrity and remove redundancy. Normal-

ization poses problems for Hadoop processing because it makes reading a record a non-

local operation, and one of the central assumptions that Hadoop makes is that it is possible

to perform (high-speed) streaming reads and writes.

A web server log is a good example of a set of records that is not normalized (for ex-

ample, the client hostnames are specified in full each time, even though the same client

may appear many times), and this is one reason that logfiles of all kinds are particularly

Hadoop: The Definitive Guide

Search WWH ::

Custom Search

Home