Database Reference
In-Depth Information
through a highly scalable, distributed batch processing system. Hadoop is not
about speed-of-thought response times, real-time warehousing, or blazing
transactional speeds; however, it is about discovery and making the once-nearly-
impossible possible from a scalability and analytics perspective.
Components of Hadoop and Related Projects
As mentioned earlier, Hadoop is generally seen as having two parts: a file
system (HDFS) and a programming paradigm (MapReduce). One of the key
components of Hadoop is the redundancy that is built into the environment.
Not only is data redundantly stored in multiple places across the cluster, but
the programming model is such that failures are expected and are resolved
automatically by running portions of the program on various servers in the
cluster. Because of this redundancy, it's possible to distribute the data and
programming across a very large cluster of commodity components, like the
cluster that we discussed earlier. It's well known that commodity hardware
components will fail (especially when you have very large numbers of them),
but this redundancy provides fault tolerance and the capability for the Hadoop
cluster to heal itself. This enables Hadoop to scale out workloads across large
clusters of inexpensive machines to work on Big Data problems.
There are many Hadoop-related projects, and some of the more notable
ones include: Apache Avro (for data serialization), Cassandra and HBase
(databases), Hive (provides ad-hoc SQL-like queries for data aggregation and
summarization), Mahout (a machine learning library), Pig (a high-level Hadoop
programming language that provides a data-flow language and execution
framework for parallel computation), and ZooKeeper (provides coordination
services for distributed applications). We don't cover these related projects
due to the size of the topic, but there's lots of information available on the
Web, and of course, BigDataUniversity.com.
Hadoop 2.0
For as long as Hadoop has been a popular conversation topic around IT water
coolers, the NameNode single point of failure (SPOF) has inevitably been
brought up. Interestingly enough, for all the talk about particular design lim-
itations, there were actually very few documented NameNode failures, which
is a testament to the resiliency of HDFS. But for mission-critical applications,
even one failure is one too many, so having a hot standby for the Name-
Node is extremely important for wider enterprise adoption of Hadoop.
 
Search WWH ::




Custom Search