Data Analytics Lifecycle - Data Science and Big Data Analytics

Database Reference

In-Depth Information

preparatory work to make the subsequent phases of model selection and execution

easier and more efficient. A common way to conduct this step involves using tools

to perform data visualizations. Approaching the data exploration in this way aids

the team in previewing the data and assessing relationships between variables at a

high level.

In many cases, stakeholders and subject matter experts have instincts and hunches

about what the data science team should be considering and analyzing. Likely,

this group had some hypothesis that led to the genesis of the project. Often,

stakeholders have a good grasp of the problem and domain, although they may not

be aware of the subtleties within the data or the model needed to accept or reject

a hypothesis. Other times, stakeholders may be correct, but for the wrong reasons

(for instance, they may be correct about a correlation that exists but infer an

incorrect reason for the correlation). Meanwhile, data scientists have to approach

problems with an unbiased mind-set and be ready to question all assumptions.

As the team begins to question the incoming assumptions and test initial ideas of

the project sponsors and stakeholders, it needs to consider the inputs and data

that will be needed, and then it must examine whether these inputs are actually

correlated with the outcomes that the team plans to predict or analyze. Some

methods and types of models will handle correlated variables better than others.

Depending on what the team is attempting to solve, it may need to consider an

alternate method, reduce the number of data inputs, or transform the inputs to

allow the team to use the best method for a given business problem. Some of these

techniques will be explored further in Chapter 3 and Chapter 6.

The key to this approach is to aim for capturing the most essential predictors

and variables rather than considering every possible variable that people think

may influence the outcome. Approaching the problem in this manner requires

iterations and testing to identify the most essential variables for the intended

analyses. The team should plan to test a range of variables to include in the model

and then focus on the most important and influential variables.

If the team plans to run regression analyses, identify the candidate predictors and

outcome variables of the model. Plan to create variables that determine outcomes

but demonstrate a strong relationship to the outcome rather than to the other input

variables. This includes remaining vigilant for problems such as serial correlation,

multicollinearity, and other typical data modeling challenges that interfere with

the validity of these models. Sometimes these issues can be avoided simply by

looking at ways to reframe a given problem. In addition, sometimes determining

correlation is all that is needed (“black box prediction”), and in other cases, the

Search WWH ::

Custom Search

Home