Advanced Analytics – Paradigms, Tools, and Techniques - Getting Started with Greenplum for Big Data Analytics

Database Reference

In-Depth Information

Modeling methods

In the next few sections, we will cover the following important analytical methods in

detail:

• Decision trees (classification)

• Association rules (unsupervised learning)

• Linear and logistic regression

• Naive Bayesian classifier (classification)

• K-means clustering (unsupervised learning)

• Text analysis.

Decision trees

Decision trees are an example of classification technique. Here, we classify data in a

tree format using data features or attributes. Since decision trees depict the flows and

possible outcome for each flow, they are used in identifying the best strategy to reach

the goal.

In decision trees, we start with testing an attribute and split the data based on that

attribute:

• We continue with the process.

• We can build multiple decision trees for the same problem.

• The efficiency and size of the tree is directly proportional to the attributes

chosen by us.

• We also need to have termination criteria:

• One obvious criterion is that all the records at the node belong to one

class and hence cannot be split.

• A significant majority of records belong to a single class (say, if 99 per-

cent records are buyers, we are fine).

• The segment contains only one or a very small number of records.

• Theimprovementisnotsubstantialenoughtowarrantmakingthesplit.

If we do not terminate at the right place, we might overfit the data.

• We can read a decision tree as a rule. Each branch connects nodes

with "and" and multiple branches are connected by "or".

Search WWH ::

Custom Search

Home