Distributed Data Processing with Cascalog - Clojure Data Analysis

Database Reference

In-Depth Information

Getting ready

The previous recipes in this chapter used the version of Hadoop that Leiningen downloaded

as one of Cascalog's dependencies. For this recipe, however, we'll need to have Hadoop

installed and running separately. Go to http://hadoop.apache.org/ and download

and install it. You might also be able to use your operating system's package manager.

Alternatively, Cloudera has a VM with a 1-node Hadoop cluster that you can download and use

CDH4PackagesandDownloads ) .

You'll still need to conigure everything. Take a look at the Hadoop website for the Getting

Started documentation of your version. Get a single node setup working.

Once it's installed and conigured, go ahead and start the servers. There's a script in the

bin directory to do this:

$ ./bin/start-dfsdfs.sh

We still need to have everything working with Clojure, however. For this, we just use the

same dependencies and references as we did in the Initializing Cascalog and Hadoop for

distributed processing recipe. However, this time, don't worry about the REPL. We'll take

care of that separately.

For data, we'll use a dataset of the U.S. domestic lights from 1990-2009. You can

download this dataset yourself from Infochimps at http://www.ericrochester.com/

clj-data-analysis/data/flights_with_colnames.csv.gz . I've unzipped it into

the data directory.

How to do it…

For this recipe, we'll insert a ile into the distributed ile system, run the Clojure REPL inside

Hadoop, and read the data back out.

1. First, the data ile must be in HDFS. We'll use the data/16285/flights_with_

colnames.csv ile. We can insert it into HDFS with this command:

$ hadoop fs -put \

datadata/16285/flights_with_colnames.csv \

flights_with_colnames.csv

2.

Now, in order to run our code in the Hadoop environment, we have to use the hadoop

command on a JAR ile created from our project. Create an empty namespace to

give the JAR ile a little content. For example, I created a ile named src/distrib_

data/cascalog_setup.clj with this content:

(ns distrib-data.cascalog-setup

(:require [cascalog.logic.ops :as c]

[clojure.string :as string])

(:use cascalog.api))

Search WWH ::

Custom Search

Home