Big Data Generation and Acquisition - Big Data: Related Technologies, Challenges and Future Prospects

Database Reference

In-Depth Information

proposed a framework called BIO-AJAX to standardize biological data so as to

conduct further computation and improve search quality. With BIO-AJAX, some

errors and repetitions may be eliminated, and common data mining technologies

can be executed more effectively.

3.2.3.3

Redundancy Elimination

Data redundancy refers to data repetitions or surplus, which usually occurs in many

datasets. Data redundancy can increase the unnecessary data transmission expense

and cause defects on storage systems, e.g., waste of storage space, leading to data

inconsistency, reduction of data reliability, and data damage. Therefore, various

redundancy reduction methods have been proposed, such as redundancy detection,

data filtering, and data compression. Such methods may apply to different datasets

or application environments. However, redundancy reduction may also bring about

certain negative effects. For example, data compression and decompression cause

additional computational burden. Therefore, the benefits of redundancy reduction

and the cost should be carefully balanced.

Data collected from different fields will increasingly appear in image or video

formats. It is well-known that images and videos contain considerable redundancy,

including temporal redundancy, spacial redundancy, statistical redundancy, and

sensing redundancy. Video compression is widely used to reduce redundancy in

video data, as specified in the many video coding standards (MPEG-2, MPEG-4,

H.263, and H.264/AVC). In [ 47 ], the authors investigated the problem of video

compression in a video surveillance system with a video sensor network. The

authors propose a new MPEG-4 based method by investigating the contextual

redundancy related to background and foreground in a scene. The low complexity

and the low compression ratio of the proposed approach were demonstrated by the

evaluation results.

On generalized data transmission or storage, repeated data deletion is a special

data compression technology, which aims to eliminate repeated data copies [ 48 ].

With repeated data deletion, individual data blocks or data segments will be assigned

with identifiers (e.g., using a hash algorithm) and stored, with the identifiers added

to the identification list. As the analysis of repeated data deletion continues, if a

new data block has an identifier that is identical to that listed in the identification

list, the new data block will be deemed as redundant and will be replaced by

the corresponding stored data block. Repeated data deletion can greatly reduce

storage requirement, which is particularly important to a big data storage system.

Apart from the aforementioned data pre-processing methods, specific data objects

shall go through some other operations such as feature extraction. Such operation

plays an important role in multimedia search and DNA analysis [ 49 - 51 ]. Usually

high-dimensional feature vectors (or high-dimensional feature points) are used to

describe such data objects and the system stores the dimensional feature vectors for

future retrieval. Data transfer is usually used to process distributed heterogeneous

data sources, especially business datasets [ 52 ].

Search WWH ::

Custom Search

Home