Related Technologies - Big Data: Related Technologies, Challenges and Future Prospects

Database Reference

In-Depth Information

provide grouped services. Chukwa is responsible for monitoring system status

and can display, monitor, and analyze collected data. Sqoop allows data to be

conveniently passed between the structured data storage and Hadoop. Mahout is

a data mining base executed on Hadoop using MapReduce. The base includes core

algorithms of collaborative filtering used for clustering and sorting, and is based on

batch processing.

Benefited from the huge success of the distributed file system of Google and the

computational model of MapReduce for processing massive data, Hadoop, its clone,

attracts more and more attentions. Hadoop is closely related to big data as nearly all

leading enterprises of big data have commercial big data solutions based on Hadoop.

Hadoop is becoming the corner stone of big data. Apache Hadoop is an open-source

software framework. Hadoop realizes the distributed processing of massive data in

the large-scale commercial server cluster, other than relying on expensive exclusive

hardware and various systems to store and process data.

Hadoop has many advantages, but the following aspects are especially relevant

to the management and analysis of big data:

Expandability : Hadoop allows the expansion or shrinkage of hardware infrastruc-

ture without changing data format. The system will automatically re-distribute

data and computing tasks will be adapted to hardware changes.

High Cost Efficiency : Hadoop applies large-scale parallel computing to commer-

cial servers, which greatly reduces the cost per TB required for storage capacity.

The large-scale computing also enables it to accommodate the continually

growing data volume.

Strong Flexibility : Hadoop may handle many kinds of data from various sources.

In addition, data from many sources can be synthesized in Hadoop for further

analysis. Therefore, it can cope with many kinds of challenges brought by big

data.

High Fault-Tolerance : it is common that data loss and miscalculation occur

during the analysis of big data, but Hadoop can recover data and correct

computing errors caused by node failures or network congestion.

2.4.2

Relationship between Hadoop and Big Data

Presently, Hadoop is widely used in big data applications in the industry, e.g.,

spam filtering, network searching, clickstream analysis, and social recommendation.

In addition, considerable academic research is now based on Hadoop. Some

representative cases are given below. As declared in June 2012, Yahoo runs Hadoop

in 42,000 servers at four data centers to support its products and services, e.g.,

searching and spam filtering, etc. At present, the biggest Hadoop cluster has 4,000

nodes, but the number of nodes will be increased to 10,000 with the release of

Hadoop 2.0. In the same month, Facebook announced that their Hadoop cluster can

process 100 PB data, which grew by 0.5 PB per day as in November 2012. Some

Search WWH ::

Custom Search

Home