Database Reference
In-Depth Information
Querying All Your Data
The approach taken by MapReduce may seem like a brute-force approach. The premise is
that the entire dataset — or at least a good portion of it — can be processed for each query.
But this is its power. MapReduce is a batch query processor, and the ability to run an ad
hoc query against your whole dataset and get the results in a reasonable time is transform-
ative. It changes the way you think about data and unlocks data that was previously
archived on tape or disk. It gives people the opportunity to innovate with data. Questions
that took too long to get answered before can now be answered, which in turn leads to new
questions and new insights.
For example, Mailtrust, Rackspace's mail division, used Hadoop for processing email logs.
One ad hoc query they wrote was to find the geographic distribution of their users. In their
words:
This data was so useful that we've scheduled the MapReduce job to run monthly and we will be using
this data to help us decide which Rackspace data centers to place new mail servers in as we grow.
By bringing several hundred gigabytes of data together and having the tools to analyze it,
the Rackspace engineers were able to gain an understanding of the data that they otherwise
would never have had, and furthermore, they were able to use what they had learned to im-
prove the service for their customers.
Search WWH ::




Custom Search