Databases Reference
In-Depth Information
Initially, the focus of the Pattern project was entirely on model scoring :
1. Create a predictive model in an analytics framework.
2. Export the model as PMML.
3. Use Pattern to translate the PMML description into a parallelized algorithm, as a
Cascading subassembly.
4. Run the model in parallel at scale on a Hadoop cluster.
More recently the project has begun work on model creation , where models get trained
at scale using Hadoop clusters and saved as PMML. Training at scale can leverage other
libraries based on Cascading, such as the Matrix API for Scalding. Then the model can
be run at scale using the model scoring features.
Of course there are many commercial analytics frameworks used for predictive mod‐
eling. Popular tools include SAS, SAP's Hana, Oracle's Exalytics , Microstrategy, Micro‐
soft SQL Server, Teradata, plus a variety of offerings from IBM such as SPSS. What these
products all share is that they are expensive to license for large-scale apps.
There are Java translators for SAS such as Carolina . Enterprise organizations typically
look to migrate analytics workloads off of licensed frameworks and onto Hadoop clus‐
ters because of the potential for enormous cost savings. However, that migration implies
the cost of rewriting and validating models in Java, Hive, Pig, etc.
In terms of Hadoop specifically, there are very good machine learning libraries available
—such as Apache Mahout or the Mallet toolkit from UMass. However, these are tightly
coupled to Apache Hadoop. They are not designed to integrate with other data frame‐
works and topologies, let alone leverage the Cascading flow planner.
Pattern implements large-scale, distributed algorithms in the context of Cascading as a
pattern language:
• In contrast with R, it emphasizes test-driven development (TDD) at scale, with
more standardized failure modes.
• In contrast with SAS, it is open sourced under an Apache ASL 2.0 license, and its
algorithms run efficiently in parallel on large-scale clusters.
• In contrast with Mahout, it implements predictive models that can leverage re‐
sources beyond Hadoop while complying with best practices for Enterprise IT.
Getting Started with Pattern
Connect to a directory on your computer where you have a few gigabytes of available
disk space, and then clone the source code repo from GitHub:
$ git clone git://github.com/Cascading/pattern.git
Search WWH ::




Custom Search