The Problem with Data - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

What's in This Topic

This topic is organized according to the particular features of a big data system, paralleling the general requirements

of a big data system, as listed in the beginning of this chapter. This first chapter describes the features of big data

and names the related tools that are introduced in the chapters that follow. My aim here is to describe as many big

data tools as possible, using practical examples. (Keep in mind, however, that writing deadlines and software update

schedules don't always mesh, so some tools or functions may have changed by the time you read this.)

All of the tools discussed in this topic have been chosen because they are supported by a large user base, which

fulfills big data's general requirements of a rich tool set and community support. Each Apache Hadoop-based tool has

its own website and often its own help forum. The ETL and reporting tools introduced in Chapters 10 and 11, although

non-Hadoop, are also supported by their own communities.

Storage: Chapter 2

Discussed in Chapter 2, storage represents the greatest number of big data requirements, as listed earlier:

•

A storage system that

•

Is distributed across many servers

•

Is scalable to thousands of servers

•

Will offer data redundancy and backup

•

Will offer redundancy in case of hardware failure

•

Will be cost-effective

A distributed storage system that is highly scalable, Hadoop meets all of these requirements. It offers a high

level of redundancy with data blocks being copied across the cluster. It is fault tolerant, having been designed with

hardware failure in mind. It also offers a low cost per unit of storage. Hadoop versions 1.x and 2.x are installed and

examined in Chapter 2, as well as a method of distributed system configuration. The Apache ZooKeeper system is

used within the Hadoop ecosystem to provide a distributed configuration system for Apache Hadoop tools.

Data Collection: Chapter 3

Automated web crawling to collect data is a much-used technology, so we need a method of collecting and

categorizing data. Chapter 3 describes two architectures using Nutch and Solr to search the web and store data. The

first stores data directly to HDFS, while the second uses Apache HBase. The chapter provides examples of both.

Processing: Chapter 4

The following big data requirements relate to data processing:

•

Parallel data processing

•

Local processing where the data is stored to reduce network bandwidth usage

Chapter 4 introduces a variety of Map Reduce programming approaches, with examples. Map Reduce programs

are developed in Java, Apache Pig, Perl, and Apache Hive.

Search WWH ::

Custom Search

Home