IBM’s Enterprise Hadoop: InfoSphere BigInsights - Harness the Power of Big Data

Database Reference

In-Depth Information

through a highly scalable, distributed batch processing system. Hadoop is not

about speed-of-thought response times, real-time warehousing, or blazing

transactional speeds; however, it is about discovery and making the once-nearly-

impossible possible from a scalability and analytics perspective.

Components of Hadoop and Related Projects

As mentioned earlier, Hadoop is generally seen as having two parts: a file

system (HDFS) and a programming paradigm (MapReduce). One of the key

components of Hadoop is the redundancy that is built into the environment.

Not only is data redundantly stored in multiple places across the cluster, but

the programming model is such that failures are expected and are resolved

automatically by running portions of the program on various servers in the

cluster. Because of this redundancy, it's possible to distribute the data and

programming across a very large cluster of commodity components, like the

cluster that we discussed earlier. It's well known that commodity hardware

components will fail (especially when you have very large numbers of them),

but this redundancy provides fault tolerance and the capability for the Hadoop

cluster to heal itself. This enables Hadoop to scale out workloads across large

clusters of inexpensive machines to work on Big Data problems.

There are many Hadoop-related projects, and some of the more notable

ones include: Apache Avro (for data serialization), Cassandra and HBase

(databases), Hive (provides ad-hoc SQL-like queries for data aggregation and

summarization), Mahout (a machine learning library), Pig (a high-level Hadoop

programming language that provides a data-flow language and execution

framework for parallel computation), and ZooKeeper (provides coordination

services for distributed applications). We don't cover these related projects

due to the size of the topic, but there's lots of information available on the

Web, and of course, BigDataUniversity.com.

Hadoop 2.0

For as long as Hadoop has been a popular conversation topic around IT water

coolers, the NameNode single point of failure (SPOF) has inevitably been

brought up. Interestingly enough, for all the talk about particular design lim-

itations, there were actually very few documented NameNode failures, which

is a testament to the resiliency of HDFS. But for mission-critical applications,

even one failure is one too many, so having a hot standby for the Name-

Node is extremely important for wider enterprise adoption of Hadoop.

Search WWH ::

Custom Search

Home