Beyond MapReduce - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

strategy in commercial applications of machine learning. Moreover, PMML combines

these definitions into an expression of business process for a complex data workflow.

Overall, that maps to Cascading quite closely—input and output variables in PMML

correspond to tuple flows, with the Cascading flow planners providing parallelization

for predictive model algorithms on Hadoop clusters.

Currently there are several companies collaborating on the Pattern project. Besides the

Random Forest and Logistic Regression algorithms, other PMML implementations in‐

clude the following:

• Linear Regression

• K-Means Clustering

• Hierarchical Clustering

• Support Vector Machines

Linear regression is probably the most common form of predictive model, such as in

Microsoft Excel spreadsheets. K-means is widely used for customer segmentation,

document search, and other kinds of predictive models.

Other good PMML resources include the following:

• Data Mining Group —XML standards and supported vendors

• Zementis PMML validator

• PMML group on LinkedIn

• “Representing predictive solutions in PMML” by Alex Guazzelli

Books Related to Pattern

For more information about PMML and predictive models in general, check out these

topics:

• PMML in Action by Alex Guazzelli, Wen-Ching Lin, and Tridivesh Jena (Create‐

Space)

• Mining of Massive Datasets by Anand Rajaraman and Jeffrey Ullman (Cambridge

University Press)

Search WWH ::

Custom Search

Home