Beyond MapReduce - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

Initially, the focus of the Pattern project was entirely on model scoring :

1. Create a predictive model in an analytics framework.

2. Export the model as PMML.

3. Use Pattern to translate the PMML description into a parallelized algorithm, as a

Cascading subassembly.

4. Run the model in parallel at scale on a Hadoop cluster.

More recently the project has begun work on model creation , where models get trained

at scale using Hadoop clusters and saved as PMML. Training at scale can leverage other

libraries based on Cascading, such as the Matrix API for Scalding. Then the model can

be run at scale using the model scoring features.

Of course there are many commercial analytics frameworks used for predictive mod‐

eling. Popular tools include SAS, SAP's Hana, Oracle's Exalytics , Microstrategy, Micro‐

soft SQL Server, Teradata, plus a variety of offerings from IBM such as SPSS. What these

products all share is that they are expensive to license for large-scale apps.

There are Java translators for SAS such as Carolina . Enterprise organizations typically

look to migrate analytics workloads off of licensed frameworks and onto Hadoop clus‐

ters because of the potential for enormous cost savings. However, that migration implies

the cost of rewriting and validating models in Java, Hive, Pig, etc.

In terms of Hadoop specifically, there are very good machine learning libraries available

—such as Apache Mahout or the Mallet toolkit from UMass. However, these are tightly

coupled to Apache Hadoop. They are not designed to integrate with other data frame‐

works and topologies, let alone leverage the Cascading flow planner.

Pattern implements large-scale, distributed algorithms in the context of Cascading as a

pattern language:

• In contrast with R, it emphasizes test-driven development (TDD) at scale, with

more standardized failure modes.

• In contrast with SAS, it is open sourced under an Apache ASL 2.0 license, and its

algorithms run efficiently in parallel on large-scale clusters.

• In contrast with Mahout, it implements predictive models that can leverage re‐

sources beyond Hadoop while complying with best practices for Enterprise IT.

Getting Started with Pattern

Connect to a directory on your computer where you have a few gigabytes of available

disk space, and then clone the source code repo from GitHub:

$ git clone git://github.com/Cascading/pattern.git

Search WWH ::

Custom Search

Home