Database Reference
In-Depth Information
2.3 Phase 2: Data Preparation
The second phase of the Data Analytics Lifecycle involves data preparation, which
includes the steps to explore, preprocess, and condition data prior to modeling and
analysis. In this phase, the team needs to create a robust environment in which it
can explore the data that is separate from a production environment. Usually, this
is done by preparing an analytics sandbox. To get the data into the sandbox, the
team needs to perform ETLT, by a combination of extracting, transforming, and
loading data into the sandbox. Once the data is in the sandbox, the team needs to
learn about the data and become familiar with it. Understanding the data in detail is
critical to the success of the project. The team also must decide how to condition and
transform data to get it into a format to facilitate subsequent analysis. The team may
perform data visualizations to help team members understand the data, including
its trends, outliers, and relationships among data variables. Each of these steps of
the data preparation phase is discussed throughout this section.
Data preparation tends to be the most labor-intensive step in the analytics lifecycle.
In fact, it is common for teams to spend at least 50% of a data science project's time
in this critical phase. If the team cannot obtain enough data of sufficient quality, it
may be unable to perform the subsequent steps in the lifecycle process.
Figure 2.4 shows an overview of the Data Analytics Lifecycle for Phase 2. The data
preparation phase is generally the most iterative and the one that teams tend to
underestimate most often. This is because most teams and leaders are anxious to
begin analyzing the data, testing hypotheses, and getting answers to some of the
questions posed in Phase 1. Many tend to jump into Phase 3 or Phase 4 to begin
rapidly developing models and algorithms without spending the time to prepare the
data for modeling. Consequently, teams come to realize the data they are working
with does not allow them to execute the models they want, and they end up back in
Phase 2 anyway.
Search WWH ::




Custom Search