Database Reference
In-Depth Information
provide grouped services. Chukwa is responsible for monitoring system status
and can display, monitor, and analyze collected data. Sqoop allows data to be
conveniently passed between the structured data storage and Hadoop. Mahout is
a data mining base executed on Hadoop using MapReduce. The base includes core
algorithms of collaborative filtering used for clustering and sorting, and is based on
batch processing.
Benefited from the huge success of the distributed file system of Google and the
computational model of MapReduce for processing massive data, Hadoop, its clone,
attracts more and more attentions. Hadoop is closely related to big data as nearly all
leading enterprises of big data have commercial big data solutions based on Hadoop.
Hadoop is becoming the corner stone of big data. Apache Hadoop is an open-source
software framework. Hadoop realizes the distributed processing of massive data in
the large-scale commercial server cluster, other than relying on expensive exclusive
hardware and various systems to store and process data.
Hadoop has many advantages, but the following aspects are especially relevant
to the management and analysis of big data:
￿
Expandability : Hadoop allows the expansion or shrinkage of hardware infrastruc-
ture without changing data format. The system will automatically re-distribute
data and computing tasks will be adapted to hardware changes.
￿
High Cost Efficiency : Hadoop applies large-scale parallel computing to commer-
cial servers, which greatly reduces the cost per TB required for storage capacity.
The large-scale computing also enables it to accommodate the continually
growing data volume.
￿
Strong Flexibility : Hadoop may handle many kinds of data from various sources.
In addition, data from many sources can be synthesized in Hadoop for further
analysis. Therefore, it can cope with many kinds of challenges brought by big
data.
￿
High Fault-Tolerance : it is common that data loss and miscalculation occur
during the analysis of big data, but Hadoop can recover data and correct
computing errors caused by node failures or network congestion.
2.4.2
Relationship between Hadoop and Big Data
Presently, Hadoop is widely used in big data applications in the industry, e.g.,
spam filtering, network searching, clickstream analysis, and social recommendation.
In addition, considerable academic research is now based on Hadoop. Some
representative cases are given below. As declared in June 2012, Yahoo runs Hadoop
in 42,000 servers at four data centers to support its products and services, e.g.,
searching and spam filtering, etc. At present, the biggest Hadoop cluster has 4,000
nodes, but the number of nodes will be increased to 10,000 with the release of
Hadoop 2.0. In the same month, Facebook announced that their Hadoop cluster can
process 100 PB data, which grew by 0.5 PB per day as in November 2012. Some
Search WWH ::




Custom Search