Database Reference
In-Depth Information
Setting Up a Big Data Sandbox in the Cloud
Before you dig in and get your hands dirty, both Amazon and Microsoft
require that you register or create an account as a condition of service.
Amazon will let you create an account using an existing Amazon account,
and Microsoft accounts are based on Live IDs.
If you have neither, creating these accounts requires typical information
such as name, contact information, an agreement to terms of service, and
verification of identity. In addition, both require a credit card for payment
and billing purposes.
As mentioned previously, these services are provided on the basis of either
compute hours or storage space. Both providers offer a free or preview
version that will allow you to try them with little if any out-of-pocket costs,
but the payment method you provide will be billed for any overages beyond
what's provided as free usage. The tutorials in this chapter assume that you
have created accounts for both Amazon and Microsoft.
Getting Started with Amazon EMR
Amazon EMRprovidesasimple,straightforward methodforprocessingand
analyzing large volumes of data by distributing the computation across a
virtual cluster of Hadoop servers. The premise is simple and starts with
uploading data to a bucket in an Amazon S3 account.
Next, a job is created and uploaded to your S3 account. This job will be
responsible for processing your data and can take a number of different
forms—such as Hive, Pig, HBase jobs, custom jars, or even a streaming
script. (Currently supported languages include Ruby, Perl, Python, PHP, R,
Base, and C++.) Your cluster will read the job and data from S3, process
your data using a Hadoop cluster running on EC2, and return the output to
S3.
Search WWH ::




Custom Search