Building Data Transformation Workflows with Pig and Cascading - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

public class CascadingCopyPipe {

public static void main(String[] args) {

String input = args[0];

String output = args[1];

Properties properties = new Properties();

AppProps.setApplicationJarClass(properties, CascadingCopyPipe.class);

HadoopFlowConnector flowConnector =

new HadoopFlowConnector(properties);

Tap inTap = new Hfs(new TextDelimited(true, ","), input);

Tap outTap = new Hfs(new TextDelimited(true, ","), output);

Pipe copyPipe = new Pipe("Copy Pipeline");

FlowDef flowDef = FlowDef.flowDef()

.addSource(copyPipe, inTap)

.addTailSink(copyPipe, outTap);

flowConnector.connect(flowDef).complete();

}

Creating a Cascade: A Simple JOIN Example

In order to build more workflows, we need to be able to connect various pipes of data

together. Cascading's abstraction model provides several types of pipe assemblies that can

be used to filter, group, and join data streams.

Just as with water pipes, it is possible for Cascading to split data streams into two

separate pipes, or merge various pipes of data into one. In order to connect two data

streams into one, we can chain pipes together using a Cascading CoGroup process.

Using a CoGroup, we can merge two streams using either a common key or a set

of common tuple values. This operation should be familiar to users of SQL, as it is

very similar to a simple JOIN . Cascading CoGroup f lows require that the values being

grouped together have the same type. Listing 9.5 presents a CoGroup example.

Listing 9.5 Using CoGroup() to join two streams on a value

import java.util.Properties;

import cascading.flow.FlowDef;

import cascading.flow.hadoop.HadoopFlowConnector;

import cascading.pipe.CoGroup;

import cascading.pipe.Pipe;

import cascading.pipe.joiner.InnerJoin;

Search WWH ::

Custom Search

Home