Database Reference
In-Depth Information
Chapter 1. Core Technologies
In 2002, when the World Wide Web was relatively new and before you “Googled” things,
Doug Cutting and Mike Cafarella wanted to crawl the Web and index the content so that they
could produce an Internet search engine. They began a project called Nutch to do this but
needed a scalable method to store the content of their indexing. The standard method to or-
ganize and store data in 2002 was by means of relational database management systems
(RDBMS), which were accessed in a language called SQL. But almost all SQL and relational
stores were not appropriate for Internet search engine storage and retrieval. They were costly,
not terribly scalable, not as tolerant to failure as required, and possibly not as performant as
desired.
In 2003 and 2004, Google released two important papers, one on the Google File System 1
and the other on a programming model on clustered servers called MapReduce . 2 Cutting and
Cafarella incorporated these technologies into their project, and eventually Hadoop was born.
Hadoop is not an acronym. Cutting's son had a yellow stuffed elephant he named Hadoop,
and somehow that name stuck to the project and the icon is a cute little elephant. Yahoo!
began using Hadoop as the basis of its search engine, and soon its use spread to many other
organizations. Now Hadoop is the predominant big data platform. There are many resources
that describe Hadoop in great detail; here you will find a brief synopsis of many components
and pointers on where to learn more.
Hadoop consists of three primary resources:
▪ The Hadoop Distributed File System (HDFS)
▪ The MapReduce programing platform
▪ The Hadoop ecosystem, a collection of tools that use or sit beside MapReduce and HDFS
to store and organize data, and manage the machines that run Hadoop
These machines are called a cluster —a group of servers, almost always running some variant
of the Linux operating system—that work together to perform a task.
The Hadoop ecosystem consists of modules that help program the system, manage and con-
figure the cluster, manage data in the cluster, manage storage in the cluster, perform analytic
tasks, and the like. The majority of the modules in this topic will describe the components of
the ecosystem and related technologies.
Search WWH ::




Custom Search