Database Reference
In-Depth Information
would be variations of hypotheses that we would need to come up with as an initial
step.
There would be a need to do a basic validation for the formed hypotheses and for
this we would need to do a preliminary data exploration. We will deal with data ex-
ploration and process in the later chapters at length.
Phase 2 - set up data
This phase forms one of the crucial initial steps where we analyze various sources
of data, strategy to aggregate/integrate data and scope the kind of data required.
As a part of this initial step, we identify the kind of data we require to solve the prob-
lem in context. We would need to consider lifespan of data, volumes, and type of the
data. Usually, there would be a need to have access to the raw data, so we would
need access to the base data as against the processed/aggregated data. One of the
important aspects of this phase is confirming the fact that the data required for this
phase is available. A detailed analysis would need to be done to identify how much
historic data would need to be extracted for running the tests against the defined
initial hypothesis. We would need to consider all the characteristics of Big Data like
volumes, varied data formats, data quality, and data influx speed. At the end of this
phase, the final data scope would be formed by seeking required validations from
domain experts.
Phase 3 - explore/transform data
The previous two phases define the analytic project scope that covers both business
and data requirements. Now it's time for data exploration or transformation. It is also
referred to as data preparation and of all the phases, this phase is the most iterative
and time-consuming one.
During data exploration, it is important to keep in mind that there should be no inter-
ference with the ongoing organizational processes.
We start with gathering all kinds of data identified in phase 2 to solve the problem
defined in phase 1.This data can be either structured, semi-structured, or unstruc-
tured, usually held in the raw formats as this allows trying various modeling tech-
niques and derive an optimal one.
Search WWH ::




Custom Search