Database Reference
In-Depth Information
How it works…
There are several moving parts to this recipe. The primary one is Hadoop. It has its own
coniguration, environment variables, and libraries that the process executing the Cascalog
queries must access. The easiest way to manage this is to run everything through the hadoop
command, so that's what we did.
The hadoop command has an fs task, which provides access to a whole range of operations
to work with HDFS. In this case, we used its -put option to move a data ile into HDFS.
Once there, we can refer to the ile using the hdfs: URI scheme. Hadoop knows how to ind
these URIs.
In the Cascalog query, hfs-textline reads the ile line by line. We use the :> operator to
bind each line to the ?line name, which is returned as the output of the query.
Parsing CSV iles with Cascalog
In the previous recipe, the ile we read was a CSV ile, but we read it line by line. That's
not optimal. Cascading provides a number of taps—sources of data or sinks to send data
to—including one for CSV and other delimited data formats. Also, Cascalog has some good
wrappers for several of these taps, but not for the CSV one.
In truth, creating a wrapper that exposes all the functionality of the delimited text format tap
will be complex. There are options for delimiter characters, quote characters, including a
header row, the types of columns, and other things. That's a lot of options, and dispatching to
the right method can be tricky.
We won't worry about how to handle all the options right here. For this recipe, we will create
a simple wrapper around the delimited text ile tap that includes some of the more common
options to read CSV iles.
Getting ready
First, we'll need to use some of the same dependencies as the ones we've been using as well
as some new ones. Here are the full dependencies that we'll need in our project.clj ile:
(defproject distrib-data "0.1.0"
:dependencies [[org.clojure/clojure "1.6.0"]
[cascalog "2.1.1"]
[org.slf4j/slf4j-api "1.7.7"]]
:profiles {:dev
{:dependencies
[[org.apache.hadoop/hadoop-core "1.2.1"]]}})
 
Search WWH ::




Custom Search