Advanced Analytics – Paradigms, Tools, and Techniques - Getting Started with Greenplum for Big Data Analytics

Database Reference

In-Depth Information

Since there are twice as many good customers (green) plotted as bad (ones in red),

it is reasonable to believe that a new customer who hasn't yet been evaluated is

twice as likely to be green rather than red. In the Bayesian analysis, this belief is

known as the prior probability. Prior probabilities are based on previous experience

and in the current example, the percentage of green and red plotted. As the word in-

dicates, it is often used to predict outcomes before they actually happen. Let us now

assume there is a total of 60 customers, with 40 of them classified as good and 20

of them as bad. Our prior probabilities for class membership are:

• Prior probability of good customers: number of good customers (40) / total

number of customers (60)

• Prior probability of bad customers: number of bad customers (20) / total num-

ber of customers (60)

Having formulated our prior probability, we are now ready to classify a new customer

(white circle in the following figure). We then calculate the number of points in the

circle belonging to each class label. From this we calculate the likelihood of the new

customer to be marked as good or bad.

K-means clustering

K-means clustering algorithm is considered one of the simplest unsupervised learn-

ing techniques. As a first step, the given data is classified into a set of fixed k

clusters. Every cluster would have its own centroid placed carefully and away from

each other. As a next step, a unique point in a cluster is associated to the nearest

centroid. This exercise is done until all the points identified are exhausted.

Based on these associations, new centroids are identified. A repeat of the preceding

exercise is done until no changes or movements in the centroids are identified. Fin-

Search WWH ::

Custom Search

Home