Databases Reference
In-Depth Information
EC2 Cloud
Hadoop MapReduce
Data
HDFS
S3 Cloud
Figure 9.7 Using Hadoop on AWS with both S3 and HDFS
model makes it an attractive storage service for many applications. More particularly,
it's well suited for use with Hadoop EC2 clusters.
You can see the dataflow model
in figure 9.7. The main change from the dataflow of
figure 9.6 is that your input data is first transferred to the S3 cloud instead of the master
node. Note that, unlike the master node, the S3 cloud storage persists independently of
your Hadoop EC2 cluster. You can create and terminate multiple Hadoop EC2 clusters
over time, and they can all read the same input data from S3. The benefit of this setup
is that you incur the monetary and time costs of copying your input data into AWS only
once, when it's copied into S3, whereas in the dataflow of figure 9.6 they're incurred
on every session of the Hadoop EC2 cluster. After the input data is copied into S3,
copying it from the S3 cloud to the cluster's HDFS is fast and free, because both S3 and
EC2 are managed within the AWS system. There's now an additional monthly storage
cost for hosting your input data in S3, but it's usually minimal. If you need to have a
scalable archival storage for your data, S3 can function in that role under this dataflow
architecture, further justifying its cost model.
The default Hadoop installation has built-in support for using S3. There's a special
Hadoop filesystem for S3, called S3 Block FileSystem, built on top of S3 to enable large
file sizes. (S3 imposes a file size limit of 5 GB.) You'll need to consider the S3 Block
FileSystem a separate filesystem from S3, just as HDFS is treated distinctly from the
underlying Unix filesystem.
The S3 Block FileSystem requires a dedicated S3 bucket. Once you've created that
S3 bucket, you can move your data from the local machine to S3:
bin/hadoop fs -put <local-filepath>
s3://<access-key-id>:<secret-access-key>@<s3-bucket>/<s3-filepath>
 
Search WWH ::




Custom Search