Databases Reference
In-Depth Information
Even after data cleaning and error correction, some incompleteness and errors in
data are likely to remain. This incompleteness and these errors must be managed during
data analysis. Doing this correctly is a challenge.
Scale: Managing large and rapidly increasing volumes of data has been a challenging
issue for many decades. With the advent of technologies like Hadoop distributions
and cloud computing, we have the ability to store massive amounts of data at relatively
low cost. These innovative platforms now aggregate multiple disparate workloads with
varying performance goals (e.g., interactive services demand that the data processing
engine return back an answer within a fixed response time cap) into very large clusters.
This level of sharing of resources on expensive and large clusters requires new ways of
determining how to run and execute data processing jobs so that we can meet the goals
of each workload cost-effectively, and to deal with system failures, which occur more
frequently as we operate on larger and larger clusters (that are required to deal with
the rapid growth in data volumes). This places a premium on declarative approaches
to expressing programs: even those programs doing complex machine learning tasks,
since global optimization across multiple users' programs is necessary for good overall
performance. Reliance on user-driven program optimizations is likely to lead to poor
cluster utilization, since users are unaware of other users' programs. System-driven
holistic optimization requires programs to be sufficiently transparent, e.g., as in relational
database systems, where declarative query languages are designed with this in mind.
Timeliness: The flip side of size is speed. The larger the data set to be processed, the
longer it will take to analyze. The design of a system that effectively deals with size is likely
also to result in a system that can process a given size of data set faster. However, it is not
just this speed that is usually meant when one speaks of velocity in the context of big data.
There are many situations in which the result of the analysis is required immediately.
For example, if a fraudulent credit card transaction is suspected, it should ideally be
flagged before the transaction is completed, potentially preventing the transaction from
taking place at all. Obviously, a full analysis of a user's purchase history is not likely to
be feasible in real time. Rather, we need to develop partial results in advance so that a
small amount of incremental computation with new data can be used to arrive at a quick
determination.
Given a large data set, it is often necessary to find elements in it that meet a specified
criterion. In the course of data analysis, this sort of search is likely to occur repeatedly.
Scanning the entire data set to find suitable elements is obviously impractical. Rather,
index structures are created in advance to permit finding qualifying elements quickly.
The problem is that each index structure is designed to support only some classes
of criteria. With new analyses desired using big data, there are new types of criteria
specified, and a need to devise new index structures to support such criteria. For
example, consider a traffic management system with information regarding thousands
of vehicles and local hot spots on roadways. The system may need to predict potential
congestion points along a route chosen by a user and then suggest alternatives. Doing
so requires evaluating multiple spatial proximity queries working with the trajectories
of moving objects. New index structures are required to support such queries. Designing
such structures becomes particularly challenging when the data volume is growing
rapidly and the queries have tight response time limits.
Privacy: The privacy of data is another huge concern, and one that increases in
the context of big data. There are numerous debates regarding the inappropriate use
of personal data, particularly through linking of data from multiple sources. Managing
 
Search WWH ::




Custom Search