Big Data, Data Warehouses, and Business Intelligence Systems - Database Processing: Fundamentals, Design, and Implementation - page 570

Database Reference

In-Depth Information

INPUT: DOCUMENT

MAP

Computer 01:

List individual words and count

how many times each word appears

Document

Section 01

REDUCE

Computer:

Combine lists of individual words and

total counts of how many times

each word appears

Computer 02:

List individual words and count

how many times each word appears

Document

Section 02

Computer 03:

List individual words and count

how many times each word appears

Document

Section 03

OUTPUT: WORD COUNT

A

And

Boy

Dog

.

.

.

The

Shown

Sun

Way

56

85

15

27

.

.

.

67

12

12

7

Document

Section N

Computer N :

List individual words and count

how many times each word appears

Figure 12-34

MapReduce

A commonly used example of the MapReduce process is counting how many times each

word is used in a document. This is illustrated in Figure 12-34, where we can see how the orig-

inal document is broken into sections, and then each section is passed to a separate computer

in the cluster for processing by the Map process. The output from each of the Map processes is

then passed to one computer, which uses the Reduce process to combine the results from each

Map process into the final output, which is the list of words and how many times each appears

in the document.

hadoop

Another Apache Software Foundation project that is becoming a fundamental Big Data devel-

opment platform is the Hadoop Distributed File System (HDFS) , which provides standard

file services to clustered servers so their file systems can function as one distributed file system

(see http://hadoop.apache.org ) . Hadoop originated as part of Cassandra, but the Hadoop proj-

ect has spun off a nonrelational data store of its own called HBase (see http://hbase .apache.org )

and a query language named Pig (see http://pig.apache.org ).

Further, all the major DBMS players are supporting Hadoop. Microsoft is planning a

Microsoft Hadoop distribution (see http://social.technet.microsoft.com/wiki/contents/articles/

microsoft-hadoop-distribution-documentation-plan.aspx ) and has teamed up with HP and Dell

to offer the SQL Server Parallel Data Warehouse (see http://www.microsoft.com/sqlserver/en/

us/solutions-technologies/data-warehousing/pdw.aspx ). Oracle has developed the Oracle Big

Data Appliance that uses Hadoop (see www.oracle.com/us/corporate/press/512001 ) . A search

of the Web on the term “MySQL Hadoop” quickly reveals that a lot is being done by the MySQL

team as well.

The usefulness and importance of these Big Data products to organizations such as

Facebook demonstrate that we can look forward to the development of not only improve-

ments to the relational DBMSs but also a very different approach to data storage and

information processing. Big Data and products associated with Big Data are rapidly changing

and evolving, and you should expect many developments in this area in the near future.

Next Page

Database Processing: Fundamentals, Design, and Implementation

Search WWH ::

Custom Search

Home