Database Reference
In-Depth Information
INPUT: DOCUMENT
MAP
Computer 01:
List individual words and count
how many times each word appears
Document
Section 01
REDUCE
Computer:
Combine lists of individual words and
total counts of how many times
each word appears
Computer 02:
List individual words and count
how many times each word appears
Document
Section 02
Computer 03:
List individual words and count
how many times each word appears
Document
Section 03
OUTPUT: WORD COUNT
A
And
Boy
Dog
.
.
.
The
Shown
Sun
Way
56
85
15
27
.
.
.
67
12
12
7
Document
Section N
Computer N :
List individual words and count
how many times each word appears
Figure 12-34
MapReduce
A commonly used example of the MapReduce process is counting how many times each
word is used in a document. This is illustrated in Figure 12-34, where we can see how the orig-
inal document is broken into sections, and then each section is passed to a separate computer
in the cluster for processing by the Map process. The output from each of the Map processes is
then passed to one computer, which uses the Reduce process to combine the results from each
Map process into the final output, which is the list of words and how many times each appears
in the document.
hadoop
Another Apache Software Foundation project that is becoming a fundamental Big Data devel-
opment platform is the Hadoop Distributed File System (HDFS) , which provides standard
file services to clustered servers so their file systems can function as one distributed file system
(see http://hadoop.apache.org ) . Hadoop originated as part of Cassandra, but the Hadoop proj-
ect has spun off a nonrelational data store of its own called HBase (see http://hbase .apache.org )
and a query language named Pig (see http://pig.apache.org ).
Further, all the major DBMS players are supporting Hadoop. Microsoft is planning a
Microsoft Hadoop distribution (see http://social.technet.microsoft.com/wiki/contents/articles/
microsoft-hadoop-distribution-documentation-plan.aspx ) and has teamed up with HP and Dell
to offer the SQL Server Parallel Data Warehouse (see http://www.microsoft.com/sqlserver/en/
us/solutions-technologies/data-warehousing/pdw.aspx ). Oracle has developed the Oracle Big
Data Appliance that uses Hadoop (see www.oracle.com/us/corporate/press/512001 ) . A search
of the Web on the term “MySQL Hadoop” quickly reveals that a lot is being done by the MySQL
team as well.
The usefulness and importance of these Big Data products to organizations such as
Facebook demonstrate that we can look forward to the development of not only improve-
ments to the relational DBMSs but also a very different approach to data storage and
information processing. Big Data and products associated with Big Data are rapidly changing
and evolving, and you should expect many developments in this area in the near future.
 
 
Search WWH ::




Custom Search