Database Reference
In-Depth Information
￿
Hashing : it is a method that essentially transforms data into shorter fixed-length
numerical values or index values. Hashing has such advantages as rapid reading,
writing, and high query speed, but a sound Hash function is hard to be found.
￿
Index : index is always an effective method to reduce the expense of disc
reading and writing, and improve insertion, deletion, modification, and query
speeds in both traditional relational databases that manage structured data, and
technologies that manage semi-structured and unstructured data. However, index
has a disadvantage that it has the additional cost for storing index files and the
index files should be maintained dynamically according to data updates.
￿
Triel : also called trie tree, a variant of Hash Tree. It is mainly applied to rapid
retrieval and word frequency statistics. The main idea of Triel is to utilize
common prefixes of character strings to reduce comparison on character strings
to the greatest extent, so as to improve query efficiency.
￿
Parallel Computing : compared to traditional serial computing, parallel comput-
ing refers to utilizing several computing resources to complete a computation
task. Its basic idea is to decompose a problem and assign them to several
independent processes to be independently completed, so as to achieve co-
processing. Presently, some classic parallel computing models include MPI
(Message Passing Interface), MapReduce, and Dryad. A qualitative comparison
of the three models is presented in Table 5.1 .
Although the parallel computing systems or tools, such as MapReduce or
Dryad, are useful for big data analysis, they are low levels tools that have a steep
learning curve. Therefore, some high-level parallel programming tools or languages
are being developed based on these systems. Such high-level languages include
Sawzall, Pig, and Hive used for MapReduce, and Scope and DryadLINQ used for
Dryad.
5.3
Architecture for Big Data Analysis
Due to the wide range of sources and variety, different structures, and the broad
application fields of big data, different analytical architectures shall be considered
for big data with different application requirements.
5.3.1
Real-Time vs. Offline Analysis
Big data analysis can be classified into real-time analysis and off-line analysis
according to the real-time requirement. Real-time analysis is mainly used in E-
commerce and finance. Since data constantly changes, rapid data analysis is needed
and analytical results shall be returned with a very short delay. The main existing
Search WWH ::




Custom Search