The Problem with Data - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

at reasonable cost) as the data grows and one that will allow for system failure. Processing all this data may take

thousands of servers, so the price of these systems must be affortable to keep the cost per unit of storage reasonable.

In licensing terms, the software must also be affordable because it will need to be installed on thousands of servers.

Further, the system must offer redundancy in terms of both data storage and hardware used. It must also operate on

commodity hardware, such as generic, low-cost servers, which helps to keep costs down. It must additionally be able

to scale to a very high degree because the data set will start large and will continue to grow. Finally, a system like this

should take the processing to the data, rather than expect the data to come to the processing. If the latter were to be

the case, networks would quickly run out of bandwidth.

Requirements for a Big Data System

This idea of a big data system requires a tool set that is rich in functionality. For example, it needs a unique kind of

distributed storage platform that is able to move very large data volumes into the system without losing data. The

tools must include some kind of configuration system to keep all of the system servers coordinated, as well as ways

of finding data and streaming it into the system in some type of ETL-based stream. (ETL, or extract, transform, load,

is a data warehouse processing sequence.) Software also needs to monitor the system and to provide downstream

destination systems with data feeds so that management can view trends and issue reports based on the data. While

this big data system may take hours to move an individual record, process it, and store it on a server, it also needs to

monitor trends in real time.

In summary, to manipulate big data, a system requires the following:

•

A method of collecting and categorizing data

•

A method of moving data into the system safely and without data loss

•

A storage system that

•

Is distributed across many servers

•

Is scalable to thousands of servers

•

Will offer data redundancy and backup

•

Will offer redundancy in case of hardware failure

•

Will be cost-effective

•

A rich tool set and community support

•

A method of distributed system configuration

•

Parallel data processing

•

System-monitoring tools

•

Reporting tools

•

ETL-like tools (preferably with a graphic interface) that can be used to build tasks that process

the data and monitor their progress

•

Scheduling tools to determine when tasks will run and show task status

•

The ability to monitor data trends in real time

•

Local processing where the data is stored to reduce network bandwidth usage

Later in this chapter I explain how this topic is organized with these requirements in mind. But let's now consider

which tools best meet the big data requirements listed above.

Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Search WWH ::

Custom Search

Home