Database Reference
In-Depth Information
preparatory work to make the subsequent phases of model selection and execution
easier and more efficient. A common way to conduct this step involves using tools
to perform data visualizations. Approaching the data exploration in this way aids
the team in previewing the data and assessing relationships between variables at a
high level.
In many cases, stakeholders and subject matter experts have instincts and hunches
about what the data science team should be considering and analyzing. Likely,
this group had some hypothesis that led to the genesis of the project. Often,
stakeholders have a good grasp of the problem and domain, although they may not
be aware of the subtleties within the data or the model needed to accept or reject
a hypothesis. Other times, stakeholders may be correct, but for the wrong reasons
(for instance, they may be correct about a correlation that exists but infer an
incorrect reason for the correlation). Meanwhile, data scientists have to approach
problems with an unbiased mind-set and be ready to question all assumptions.
As the team begins to question the incoming assumptions and test initial ideas of
the project sponsors and stakeholders, it needs to consider the inputs and data
that will be needed, and then it must examine whether these inputs are actually
correlated with the outcomes that the team plans to predict or analyze. Some
methods and types of models will handle correlated variables better than others.
Depending on what the team is attempting to solve, it may need to consider an
alternate method, reduce the number of data inputs, or transform the inputs to
allow the team to use the best method for a given business problem. Some of these
techniques will be explored further in Chapter 3 and Chapter 6.
The key to this approach is to aim for capturing the most essential predictors
and variables rather than considering every possible variable that people think
may influence the outcome. Approaching the problem in this manner requires
iterations and testing to identify the most essential variables for the intended
analyses. The team should plan to test a range of variables to include in the model
and then focus on the most important and influential variables.
If the team plans to run regression analyses, identify the candidate predictors and
outcome variables of the model. Plan to create variables that determine outcomes
but demonstrate a strong relationship to the outcome rather than to the other input
variables. This includes remaining vigilant for problems such as serial correlation,
multicollinearity, and other typical data modeling challenges that interfere with
the validity of these models. Sometimes these issues can be avoided simply by
looking at ways to reframe a given problem. In addition, sometimes determining
correlation is all that is needed (“black box prediction”), and in other cases, the
Search WWH ::




Custom Search