Big Data Analysis - Big Data: Related Technologies, Challenges and Future Prospects

Database Reference

In-Depth Information

Hashing : it is a method that essentially transforms data into shorter fixed-length

numerical values or index values. Hashing has such advantages as rapid reading,

writing, and high query speed, but a sound Hash function is hard to be found.

Index : index is always an effective method to reduce the expense of disc

reading and writing, and improve insertion, deletion, modification, and query

speeds in both traditional relational databases that manage structured data, and

technologies that manage semi-structured and unstructured data. However, index

has a disadvantage that it has the additional cost for storing index files and the

index files should be maintained dynamically according to data updates.

Triel : also called trie tree, a variant of Hash Tree. It is mainly applied to rapid

retrieval and word frequency statistics. The main idea of Triel is to utilize

common prefixes of character strings to reduce comparison on character strings

to the greatest extent, so as to improve query efficiency.

Parallel Computing : compared to traditional serial computing, parallel comput-

ing refers to utilizing several computing resources to complete a computation

task. Its basic idea is to decompose a problem and assign them to several

independent processes to be independently completed, so as to achieve co-

processing. Presently, some classic parallel computing models include MPI

(Message Passing Interface), MapReduce, and Dryad. A qualitative comparison

of the three models is presented in Table 5.1 .

Although the parallel computing systems or tools, such as MapReduce or

Dryad, are useful for big data analysis, they are low levels tools that have a steep

learning curve. Therefore, some high-level parallel programming tools or languages

are being developed based on these systems. Such high-level languages include

Sawzall, Pig, and Hive used for MapReduce, and Scope and DryadLINQ used for

Dryad.

5.3

Architecture for Big Data Analysis

Due to the wide range of sources and variety, different structures, and the broad

application fields of big data, different analytical architectures shall be considered

for big data with different application requirements.

5.3.1

Real-Time vs. Offline Analysis

Big data analysis can be classified into real-time analysis and off-line analysis

according to the real-time requirement. Real-time analysis is mainly used in E-

commerce and finance. Since data constantly changes, rapid data analysis is needed

and analytical results shall be returned with a very short delay. The main existing

Search WWH ::

Custom Search

Home