Databases Reference
In-Depth Information
partially fill the gap. It implements a column-oriented data store modeled on Google's
BigTable on top of Hadoop and HDFS, and it also provides indexing for HDFS. With
HBase it is possible to have multiple large tables or even just one large table distributed
beneath Hadoop.
There are a few areas where Hadoop, in its current form, scores well. An obvious one
is as an extract, transform, load (ETL) staging system when an organization has a flood of
data and only a small proportion can be put to use. The data can be stored in Hadoop and
jobs run to extract useful data to put into a database for deeper analysis.
Hadoop was built as a parallel processing environment for large data volumes,
not as a database. For that reason, it can be very useful if you need to manipulate data
in sophisticated ways. For example, it has been used both to render 3D video and for
scientific programming.
It is a massively parallel platform that can be used in many ways. Database
capabilities have been added, but even with these it is still best to not think of it as a
database product. The open-source nature of Hadoop allowed developers to try it, and
this drove early popularity as discussed earlier in Chapter 4. Because it became popular,
many vendors began to exploit its capabilities, adding to it or linking it to their databases.
Hadoop has generated its own software ecosystem (Figure 5-5 ).
Figure 5-5. Hadoop conceptual framework
 
Search WWH ::




Custom Search