Building Data Transformation Workflows with Pig and Cascading - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

Tap outputTap = new Hfs(new TextDelimited(true,","), output_dir);

Pipe outputPipe = new Pipe("output pipe", joinPipe);

// The Flow definition hooks it all together

FlowDef flowDef = FlowDef.flowDef()

.addSource(salesPipe, websalesTap)

.addSource(usersPipe, usersTap)

.addTailSink(outputPipe, outputTap);

flowConnector.connect(flowDef).complete();

}

Deploying a Cascading Application on a Hadoop Cluster

Once your application is working as expected on small amounts of local test data, you

can deploy the same packaged JAR file to a Hadoop cluster with no code modifica-

tions. Remember to add the Cascading JAR files to your application's lib directory

and to make sure your source data is available in HDFS. Then move your application

to a node on your Hadoop cluster, and launch it using the hadoop jar command.

Listing 9.6 shows an example of running a Cascading application on a Hadoop cluster.

Listing 9.6 Running your Cascading application on a Hadoop cluster

# Make sure your source data is available in HDFS

$> hadoop dfs -put websales.csv /user/hduser/websales.csv

$> hadoop dfs -put user_info.csv /user/hduser/ user_info.csv

# Run the hadoop jar command

$> hadoop jar mycascading.jar /user/hduser/websales.csv \

/user/hduser/users_info.csv output_directory

INFO util.HadoopUtil: resolving application jar from found

main method on: CascadingSimpleJoinPipe

INFO planner.HadoopPlanner: using application jar:

/home/hduser/ mycascading.jar

INFO property.AppProps: using app.id: 35FEB5D0590D62AFA6D496F3F17C14B9

INFO mapred.FileInputFormat: Total input paths to process : 1

# etc...

If you are relatively new to Hadoop, and Cascading is your introduction to writing

custom JAR files for the framework, take a moment to appreciate what is happen-

ing behind the scenes of the hadoop jar command. A Hadoop cluster comprises

a collection of services that have specialized roles. Services known as JobTrackers

are responsible for keeping track of and sending individual tasks to services on other

machines. TaskTrackers are the cluster's workers; these services accept jobs from the

Search WWH ::

Custom Search

Home