Big Data Analytics Methodology - Big Data Imperatives

Databases Reference

In-Depth Information

Even after data cleaning and error correction, some incompleteness and errors in

data are likely to remain. This incompleteness and these errors must be managed during

data analysis. Doing this correctly is a challenge.

Scale: Managing large and rapidly increasing volumes of data has been a challenging

issue for many decades. With the advent of technologies like Hadoop distributions

and cloud computing, we have the ability to store massive amounts of data at relatively

low cost. These innovative platforms now aggregate multiple disparate workloads with

varying performance goals (e.g., interactive services demand that the data processing

engine return back an answer within a fixed response time cap) into very large clusters.

This level of sharing of resources on expensive and large clusters requires new ways of

determining how to run and execute data processing jobs so that we can meet the goals

of each workload cost-effectively, and to deal with system failures, which occur more

frequently as we operate on larger and larger clusters (that are required to deal with

the rapid growth in data volumes). This places a premium on declarative approaches

to expressing programs: even those programs doing complex machine learning tasks,

since global optimization across multiple users' programs is necessary for good overall

performance. Reliance on user-driven program optimizations is likely to lead to poor

cluster utilization, since users are unaware of other users' programs. System-driven

holistic optimization requires programs to be sufficiently transparent, e.g., as in relational

database systems, where declarative query languages are designed with this in mind.

Timeliness: The flip side of size is speed. The larger the data set to be processed, the

longer it will take to analyze. The design of a system that effectively deals with size is likely

also to result in a system that can process a given size of data set faster. However, it is not

just this speed that is usually meant when one speaks of velocity in the context of big data.

There are many situations in which the result of the analysis is required immediately.

For example, if a fraudulent credit card transaction is suspected, it should ideally be

flagged before the transaction is completed, potentially preventing the transaction from

taking place at all. Obviously, a full analysis of a user's purchase history is not likely to

be feasible in real time. Rather, we need to develop partial results in advance so that a

small amount of incremental computation with new data can be used to arrive at a quick

determination.

Given a large data set, it is often necessary to find elements in it that meet a specified

criterion. In the course of data analysis, this sort of search is likely to occur repeatedly.

Scanning the entire data set to find suitable elements is obviously impractical. Rather,

index structures are created in advance to permit finding qualifying elements quickly.

The problem is that each index structure is designed to support only some classes

of criteria. With new analyses desired using big data, there are new types of criteria

specified, and a need to devise new index structures to support such criteria. For

example, consider a traffic management system with information regarding thousands

of vehicles and local hot spots on roadways. The system may need to predict potential

congestion points along a route chosen by a user and then suggest alternatives. Doing

so requires evaluating multiple spatial proximity queries working with the trajectories

of moving objects. New index structures are required to support such queries. Designing

such structures becomes particularly challenging when the data volume is growing

rapidly and the queries have tight response time limits.

Privacy: The privacy of data is another huge concern, and one that increases in

the context of big data. There are numerous debates regarding the inappropriate use

of personal data, particularly through linking of data from multiple sources. Managing

Big Data Imperatives

Search WWH ::

Custom Search

Home