Distributed Data Processing with Cascalog - Clojure Data Analysis

Database Reference

In-Depth Information

How it works…

This function takes a number of options, such as fields , has-header , delim , and

quote-str . The defaults are for CSV iles, but they can be easily overridden for a variety

of other formats. We saw the use of the :has-header option in the previous example.

With the options in hand, it creates a TextDelimited scheme object. And inally passes it

to the hfs-tap function, which wraps the scheme object in a tap. The tap serves as a data

generator, and we bind the values from it to the names in our query.

There's more

Hadoop can consume a number of different ile formats. Avro ( http://avro.apache.org/ )

uses JSON schemas to store data in a fast, compact, and binary data format. Sequence iles

( http://wiki.apache.org/hadoop/SequenceFile ) contain a binary key-value store.

XML and JSON are also common data formats.

If we want to parse our own data formats in Cascading or Cascalog, we'll need to write our

own source tap ( http://docs.cascading.org/cascading/2.5/userguide/html/

ch03s05.html ). If it's a delimited text format, such as CSV or TSV, we can base the new tap

on cascading.scheme.hadoop.TextDelimited , just as we did in this recipe. See the

JavaDocs for this class at http://docs.cascading.org/cascading/2.5/cascading-

hadoop/cascading/scheme/hadoop/TextDelimited.html for more information on this.

Executing complex queries with Cascalog

So far, we've seen basic Cascalog predicates and queries. We saw queries that pull data from

one source generator and maybe include one predicate test. In this recipe, we'll see several

more complex queries.

Getting ready

For this recipe, we'll need the same project.clj ile and dependencies from the

Initializing Cascalog and Hadoop for distributed processing recipe. We'll also use the

Doctor Who companion data that we deined in that recipe. The source code for this

data is available in the code for the topic, and you can also download just the code from

to create this dataset.

Search WWH ::

Custom Search

Home