Database Reference
In-Depth Information
How it works…
This function takes a number of options, such as
fields
,
has-header
,
delim
, and
quote-str
. The defaults are for CSV iles, but they can be easily overridden for a variety
of other formats. We saw the use of the
:has-header
option in the previous example.
With the options in hand, it creates a
TextDelimited
scheme object. And inally passes it
to the
hfs-tap
function, which wraps the scheme object in a tap. The tap serves as a data
generator, and we bind the values from it to the names in our query.
There's more
uses JSON schemas to store data in a fast, compact, and binary data format. Sequence iles
XML and JSON are also common data formats.
If we want to parse our own data formats in Cascading or Cascalog, we'll need to write our
own source tap (
http://docs.cascading.org/cascading/2.5/userguide/html/
ch03s05.html
). If it's a delimited text format, such as CSV or TSV, we can base the new tap
on
cascading.scheme.hadoop.TextDelimited
, just as we did in this recipe. See the
JavaDocs for this class at
http://docs.cascading.org/cascading/2.5/cascading-
hadoop/cascading/scheme/hadoop/TextDelimited.html
for more information on this.
Executing complex queries with Cascalog
So far, we've seen basic Cascalog predicates and queries. We saw queries that pull data from
one source generator and maybe include one predicate test. In this recipe, we'll see several
more complex queries.
Getting ready
For this recipe, we'll need the same
project.clj
ile and dependencies from the
Initializing Cascalog and Hadoop for distributed processing
recipe. We'll also use the
Doctor Who
companion data that we deined in that recipe. The source code for this
data is available in the code for the topic, and you can also download just the code from
to create this dataset.