Running Hadoop in the cloud - Hadoop in Action

Databases Reference

In-Depth Information

EC2 Cloud

Hadoop MapReduce

Data

HDFS

S3 Cloud

Figure 9.7 Using Hadoop on AWS with both S3 and HDFS

model makes it an attractive storage service for many applications. More particularly,

it's well suited for use with Hadoop EC2 clusters.

You can see the dataflow model

in figure 9.7. The main change from the dataflow of

figure 9.6 is that your input data is first transferred to the S3 cloud instead of the master

node. Note that, unlike the master node, the S3 cloud storage persists independently of

your Hadoop EC2 cluster. You can create and terminate multiple Hadoop EC2 clusters

over time, and they can all read the same input data from S3. The benefit of this setup

is that you incur the monetary and time costs of copying your input data into AWS only

once, when it's copied into S3, whereas in the dataflow of figure 9.6 they're incurred

on every session of the Hadoop EC2 cluster. After the input data is copied into S3,

copying it from the S3 cloud to the cluster's HDFS is fast and free, because both S3 and

EC2 are managed within the AWS system. There's now an additional monthly storage

cost for hosting your input data in S3, but it's usually minimal. If you need to have a

scalable archival storage for your data, S3 can function in that role under this dataflow

architecture, further justifying its cost model.

The default Hadoop installation has built-in support for using S3. There's a special

Hadoop filesystem for S3, called S3 Block FileSystem, built on top of S3 to enable large

file sizes. (S3 imposes a file size limit of 5 GB.) You'll need to consider the S3 Block

FileSystem a separate filesystem from S3, just as HDFS is treated distinctly from the

underlying Unix filesystem.

The S3 Block FileSystem requires a dedicated S3 bucket. Once you've created that

S3 bucket, you can move your data from the local machine to S3:

bin/hadoop fs -put <local-filepath>

➥

s3://<access-key-id>:<secret-access-key>@<s3-bucket>/<s3-filepath>

Search WWH ::

Custom Search

Home