Database Reference
In-Depth Information
What's in This Topic
This topic is organized according to the particular features of a big data system, paralleling the general requirements
of a big data system, as listed in the beginning of this chapter. This first chapter describes the features of big data
and names the related tools that are introduced in the chapters that follow. My aim here is to describe as many big
data tools as possible, using practical examples. (Keep in mind, however, that writing deadlines and software update
schedules don't always mesh, so some tools or functions may have changed by the time you read this.)
All of the tools discussed in this topic have been chosen because they are supported by a large user base, which
fulfills big data's general requirements of a rich tool set and community support. Each Apache Hadoop-based tool has
its own website and often its own help forum. The ETL and reporting tools introduced in Chapters 10 and 11, although
non-Hadoop, are also supported by their own communities.
Storage: Chapter 2
Discussed in Chapter 2, storage represents the greatest number of big data requirements, as listed earlier:
A storage system that
Is distributed across many servers
Is scalable to thousands of servers
Will offer data redundancy and backup
Will offer redundancy in case of hardware failure
Will be cost-effective
A distributed storage system that is highly scalable, Hadoop meets all of these requirements. It offers a high
level of redundancy with data blocks being copied across the cluster. It is fault tolerant, having been designed with
hardware failure in mind. It also offers a low cost per unit of storage. Hadoop versions 1.x and 2.x are installed and
examined in Chapter 2, as well as a method of distributed system configuration. The Apache ZooKeeper system is
used within the Hadoop ecosystem to provide a distributed configuration system for Apache Hadoop tools.
Data Collection: Chapter 3
Automated web crawling to collect data is a much-used technology, so we need a method of collecting and
categorizing data. Chapter 3 describes two architectures using Nutch and Solr to search the web and store data. The
first stores data directly to HDFS, while the second uses Apache HBase. The chapter provides examples of both.
Processing: Chapter 4
The following big data requirements relate to data processing:
Parallel data processing
Local processing where the data is stored to reduce network bandwidth usage
Chapter 4 introduces a variety of Map Reduce programming approaches, with examples. Map Reduce programs
are developed in Java, Apache Pig, Perl, and Apache Hive.
 
Search WWH ::




Custom Search