Database Reference
In-Depth Information
of data. Here, we examine several such applications. Although being in different
scientific fields, the applications have similar and increasing demand on data analy-
sis. The first example is related to computational biology. GenBank is a nucleotide
sequence database maintained by the U.S. National Bio-Technology Innovation
Center. Data in this database may double every 10 months. By August 2009,
Genbank has more than 250 billion bases from 150,000 different organisms [ 6 ].
The second example is related to astronomy. Sloan Digital Sky Survey (SDSS), the
biggest sky survey project in astronomy, has recorded 25TB data from 1998 to 2008.
As the resolution of the telescope is improved, by 2004, the data volume generated
per night will surpass 20TB. The last application is related to high-energy physics.
In the beginning of 2008, the Atlas experiment of Large Hadron Collider (LHC) of
European Organization for Nuclear Research generates raw data at 2PB/s and stores
about 10TB processed data per year.
In addition, pervasive sensing and computing among nature, commercial, Inter-
net, government, and social environments are generating heterogeneous data with
unprecedented complexity. These datasets have their unique data characteristics in
scale, time dimension, and data category. For example, mobile data were recorded
with respect to positions, movement, approximation degrees, communications,
multimedia, use of applications, and audio environment. According to the appli-
cation environment and requirements, such datasets can be classified into different
categories, so as to select the proper and feasible solutions for big data.
3.2
Big Data Acquisition
As the second phase of the big data system, big data acquisition includes data
collection, data transmission, and data pre-processing. During big data acquisition,
once the raw data is collected, an efficient transmission mechanism should be used
to send it to a proper storage management system to support different analytical
applications. The collected datasets may sometimes include much redundant or
useless data, which unnecessarily increases storage space and affects the subsequent
data analysis. For example, high redundancy is very common among datasets
collected by sensors for environment monitoring. Data compression techniques can
be applied to reduce the redundancy. Therefore, data pre-processing operations are
indispensable to ensure efficient data storage and exploitation.
3.2.1
Data Collection
Data collection is to utilize special data collection techniques to acquire raw data
from a specific data generation environment. Four common data collection methods
are shown as follows.
Search WWH ::




Custom Search