Big Data Analytics Methodology - Big Data Imperatives

Databases Reference

In-Depth Information

If users are to compose and build complex analytical solutions over big data, it is

essential that they have appropriate high-level primitives to specify their needs in such

flexible systems. The map-reduce framework has been tremendously valuable, but is only

a first step. Even declarative languages that exploit it, such as Pig Latin, are at a rather

low level when it comes to complex analysis tasks. At present big data analytics solutions

employ a host of tools/processes to develop an end-to-end production-ready system.

Each operation within the system (cleaning, extraction, modeling, etc.) potentially runs

on a very large data set. Furthermore, each operation itself is sufficiently complex that

there are many choices and optimizations possible in how it is implemented.

The very fact that big data analysis typically involves multiple phases highlights

a challenge that arises routinely in practice: production systems must run complex

analytic pipelines, or workflows, at routine intervals, e.g., hourly or daily. New data must

be incrementally accounted for, taking into account the results of prior analysis and

preexisting data. And of course, provenance must be preserved, and must include the

phases in the analytic pipeline. Current systems offer little to no support for such big data

pipelines, and this is in itself a challenging objective.

In the sections below we will discuss a methodology that outlines an approach for

developing big data analytics solutions.

Big Data Analytics Methodology

The big data analytics methodology is a combination of sequential execution of tasks in

certain phases and highly iterative execution steps in certain phases. Because of the scale

issue associated with big data system, designers must adhere to a pragmatic approach of

modifying and expanding their processes gradually across several activities as opposed to

designing a system once and all keeping the end state in mind.

Figure 7-1 provides a high-level view of the big data analytics methodology, and

big data analytics designers (i.e., architects, statisticians, analysts, etc.) are advised to

iterate through the steps outlined in Figure 7-1 . The designer should plan to complete

several cycles of design and experimentation during steps 2 through 5. Each cycle should

include additional and larger data samples and apply different analytics techniques as

appropriate for data and relevant for solving the business problem. Designers should

revisit the entire framework periodically after the system starts running in production

(steps 1 through 7; see Figure 7-1 ).

Search WWH ::

Custom Search

Home