Databases Reference
In-Depth Information
If users are to compose and build complex analytical solutions over big data, it is
essential that they have appropriate high-level primitives to specify their needs in such
flexible systems. The map-reduce framework has been tremendously valuable, but is only
a first step. Even declarative languages that exploit it, such as Pig Latin, are at a rather
low level when it comes to complex analysis tasks. At present big data analytics solutions
employ a host of tools/processes to develop an end-to-end production-ready system.
Each operation within the system (cleaning, extraction, modeling, etc.) potentially runs
on a very large data set. Furthermore, each operation itself is sufficiently complex that
there are many choices and optimizations possible in how it is implemented.
The very fact that big data analysis typically involves multiple phases highlights
a challenge that arises routinely in practice: production systems must run complex
analytic pipelines, or workflows, at routine intervals, e.g., hourly or daily. New data must
be incrementally accounted for, taking into account the results of prior analysis and
preexisting data. And of course, provenance must be preserved, and must include the
phases in the analytic pipeline. Current systems offer little to no support for such big data
pipelines, and this is in itself a challenging objective.
In the sections below we will discuss a methodology that outlines an approach for
developing big data analytics solutions.
Big Data Analytics Methodology
The big data analytics methodology is a combination of sequential execution of tasks in
certain phases and highly iterative execution steps in certain phases. Because of the scale
issue associated with big data system, designers must adhere to a pragmatic approach of
modifying and expanding their processes gradually across several activities as opposed to
designing a system once and all keeping the end state in mind.
Figure 7-1 provides a high-level view of the big data analytics methodology, and
big data analytics designers (i.e., architects, statisticians, analysts, etc.) are advised to
iterate through the steps outlined in Figure 7-1 . The designer should plan to complete
several cycles of design and experimentation during steps 2 through 5. Each cycle should
include additional and larger data samples and apply different analytics techniques as
appropriate for data and relevant for solving the business problem. Designers should
revisit the entire framework periodically after the system starts running in production
(steps 1 through 7; see Figure 7-1 ).
 
Search WWH ::




Custom Search