Database Reference
In-Depth Information
Understanding the Windows Azure Storage Blob
HDInsight introduces the unique Windows Azure Storage Blob (WASB) as the storage media for Hadoop on the
cloud. As opposed to the native HDFS, the Windows Azure HDInsight service uses WASB as its default storage for the
Hadoop clusters. WASB uses Azure blob storage underneath to persist the data. Of course, you can choose to override
the defaults and set it back to HDFS, but there are some advantages to choosing WASB over HDFS:
WASB storage incorporates all the HDFS features, like fault tolerance, geo replication, and
partitioning.
If you use WASB, you disconnect the data and compute nodes. That is not possible with
Hadoop and HDFS, where each node is both a data node and a compute node. This means
that if you are not running large jobs, you can reduce the cluster's size and just keep the
storage—and probably at a reduced cost.
You can spin up your Hadoop cluster only when needed, and you can use it as a “transient
compute cluster” instead of as permanent storage. It is not always the case that you want to
run idle compute clusters to store data. In most cases, it is more advantageous to create the
compute resources on-demand, process data, and then de-allocate them without losing your
data. You cannot do that in HDFS, but it is already done for you if you use WASB.
You can spin up multiple Hadoop clusters that crunch the same set of data stored in a
common blob location. In doing so, you essentially leverage Azure blob storage as a shared
data store.
Storage costs have been benchmarked to approximately five times lower for WASB than for
HDFS.
HDInsight has added significant enhancements to improve read/write performance when
running Map/Reduce jobs on the data from the Azure blob store.
You can process data directly, without importing to HDFS first. Many people already on
a cloud infrastructure have existing pipelines, and those pipelines can push data directly
to WASB.
Azure blob storage is a useful place to store data across diverse services. In a typical case,
HDInsight is a piece of a larger solution in Windows Azure. Azure blob storage can be the
common link for unstructured blob data in such an environment.
Most hdFs commands—such as ls , copyFromLocal , and mkdir —will still work as expected. only the
commands that are specific to the native hdFs implementation (which is referred to as DFS ), such as fschk and
dfsadmin , will show different behavior on WasB.
Note
Figure 2-2 shows the architecture of an HDInsight service using WASB.
 
 
Search WWH ::




Custom Search