Database Reference
In-Depth Information
Getting ready
The previous recipes in this chapter used the version of Hadoop that Leiningen downloaded
as one of Cascalog's dependencies. For this recipe, however, we'll need to have Hadoop
installed and running separately. Go to http://hadoop.apache.org/ and download
and install it. You might also be able to use your operating system's package manager.
Alternatively, Cloudera has a VM with a 1-node Hadoop cluster that you can download and use
( https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads#CDHDownloads-
CDH4PackagesandDownloads ) .
You'll still need to conigure everything. Take a look at the Hadoop website for the Getting
Started documentation of your version. Get a single node setup working.
Once it's installed and conigured, go ahead and start the servers. There's a script in the
bin directory to do this:
$ ./bin/start-dfsdfs.sh
We still need to have everything working with Clojure, however. For this, we just use the
same dependencies and references as we did in the Initializing Cascalog and Hadoop for
distributed processing recipe. However, this time, don't worry about the REPL. We'll take
care of that separately.
For data, we'll use a dataset of the U.S. domestic lights from 1990-2009. You can
download this dataset yourself from Infochimps at http://www.ericrochester.com/
clj-data-analysis/data/flights_with_colnames.csv.gz . I've unzipped it into
the data directory.
How to do it…
For this recipe, we'll insert a ile into the distributed ile system, run the Clojure REPL inside
Hadoop, and read the data back out.
1. First, the data ile must be in HDFS. We'll use the data/16285/flights_with_
colnames.csv ile. We can insert it into HDFS with this command:
$ hadoop fs -put \
datadata/16285/flights_with_colnames.csv \
flights_with_colnames.csv
2.
Now, in order to run our code in the Hadoop environment, we have to use the hadoop
command on a JAR ile created from our project. Create an empty namespace to
give the JAR ile a little content. For example, I created a ile named src/distrib_
data/cascalog_setup.clj with this content:
(ns distrib-data.cascalog-setup
(:require [cascalog.logic.ops :as c]
[clojure.string :as string])
(:use cascalog.api))
 
Search WWH ::




Custom Search