Database Reference
In-Depth Information
objective of the project is to understand the causal relationship better. In the latter
case, the team wants the model to have explanatory power and needs to forecast or
stress test the model under a variety of situations and with different datasets.
2.4.2 Model Selection
In the model selection subphase, the team's main goal is to choose an analytical
technique, or a short list of candidate techniques, based on the end goal of the
project. For the context of this topic, a model is discussed in general terms. In this
case, a model simply refers to an abstraction from reality. One observes events
happening in a real-world situation or with live data and attempts to construct
models that emulate this behavior with a set of rules and conditions. In the case
of machine learning and data mining, these rules and conditions are grouped into
several general sets of techniques, such as classification, association rules, and
clustering. When reviewing this list of types of potential models, the team can
winnow down the list to several viable models to try to address a given problem.
More details on matching the right models to common types of business problems
are provided in Chapter 3 and Chapter 4, “Advanced Analytical Theory and
Methods: Clustering.”
An additional consideration in this area for dealing with Big Data involves
determining if the team will be using techniques that are best suited for structured
data, unstructured data, or a hybrid approach. For instance, the team can leverage
MapReduce to analyze unstructured data, as highlighted in Chapter 10. Lastly, the
team should take care to identify and document the modeling assumptions it is
making as it chooses and constructs preliminary models.
Typically, teams create the initial models using a statistical software package such
as R, SAS, or Matlab. Although these tools are designed for data mining and
machine learning algorithms, they may have limitations when applying the models
to very large datasets, as is common with Big Data. As such, the team may consider
redesigning these algorithms to run in the database itself during the pilot phase
mentioned in Phase 6.
The team can move to the model building phase once it has a good idea about
the type of model to try and the team has gained enough knowledge to refine the
analytics plan. Advancing from this phase requires a general methodology for the
analytical model, a solid understanding of the variables and techniques to use, and
a description or diagram of the analytic workflow.
Search WWH ::




Custom Search