Database Reference
In-Depth Information
{and, 2}
{creeps, 1}
{day, 2}
{from, 1}
{in, 1}
{last, 1}
{of, 1}
{pace, 1}
{petty, 1}
{recorded, 1}
{syllable, 1}
{the, 1}
{this, 1}
{time, 1}
{to, 2}
{tomorrow, 3}
At first, it might seem a bit awkward to decompose your computation into
a Map and Reduce phase. However, a very wide range of problems can be
broken into a series of Map and Reduce operations, and there are a number
of tools built on top of Hadoop that can help.
Storage System
MapReduce doesn't rely on a particular storage system or storage format the
way relational databases do. However, in practice, a standard filesystem is
not well suited for performing MapReduce operations. If you're just reading
from a single disk, you are likely going to be limited by the disk speed, and
processing the data in parallel isn't actually going to help.
Apache Hadoop uses a custom distributed filesystem called HDFS that is
in many ways similar to the Google File System (GFS) or the Colossus File
System used by BigQuery. Because HDFS is distributed among many nodes,
it can read in parallel and hopefully keep up with the Map and Reduce
workers.
Worker Management
The final key piece of MapReduce architecture is the worker manager, called
the Controller in Google MapReduce or the Name Node in Hadoop. When
dealing with large numbers of workers (many MapReduces use hundreds or
Search WWH ::




Custom Search