Database Reference
In-Depth Information
at reasonable cost) as the data grows and one that will allow for system failure. Processing all this data may take
thousands of servers, so the price of these systems must be affortable to keep the cost per unit of storage reasonable.
In licensing terms, the software must also be affordable because it will need to be installed on thousands of servers.
Further, the system must offer redundancy in terms of both data storage and hardware used. It must also operate on
commodity hardware, such as generic, low-cost servers, which helps to keep costs down. It must additionally be able
to scale to a very high degree because the data set will start large and will continue to grow. Finally, a system like this
should take the processing to the data, rather than expect the data to come to the processing. If the latter were to be
the case, networks would quickly run out of bandwidth.
Requirements for a Big Data System
This idea of a big data system requires a tool set that is rich in functionality. For example, it needs a unique kind of
distributed storage platform that is able to move very large data volumes into the system without losing data. The
tools must include some kind of configuration system to keep all of the system servers coordinated, as well as ways
of finding data and streaming it into the system in some type of ETL-based stream. (ETL, or extract, transform, load,
is a data warehouse processing sequence.) Software also needs to monitor the system and to provide downstream
destination systems with data feeds so that management can view trends and issue reports based on the data. While
this big data system may take hours to move an individual record, process it, and store it on a server, it also needs to
monitor trends in real time.
In summary, to manipulate big data, a system requires the following:
A method of collecting and categorizing data
A method of moving data into the system safely and without data loss
A storage system that
Is distributed across many servers
Is scalable to thousands of servers
Will offer data redundancy and backup
Will offer redundancy in case of hardware failure
Will be cost-effective
A rich tool set and community support
A method of distributed system configuration
Parallel data processing
System-monitoring tools
Reporting tools
ETL-like tools (preferably with a graphic interface) that can be used to build tasks that process
the data and monitor their progress
Scheduling tools to determine when tasks will run and show task status
The ability to monitor data trends in real time
Local processing where the data is stored to reduce network bandwidth usage
Later in this chapter I explain how this topic is organized with these requirements in mind. But let's now consider
which tools best meet the big data requirements listed above.
 
Search WWH ::




Custom Search