Database Reference
In-Depth Information
How it works…
This function takes a number of options, such as fields , has-header , delim , and
quote-str . The defaults are for CSV iles, but they can be easily overridden for a variety
of other formats. We saw the use of the :has-header option in the previous example.
With the options in hand, it creates a TextDelimited scheme object. And inally passes it
to the hfs-tap function, which wraps the scheme object in a tap. The tap serves as a data
generator, and we bind the values from it to the names in our query.
There's more
Hadoop can consume a number of different ile formats. Avro ( http://avro.apache.org/ )
uses JSON schemas to store data in a fast, compact, and binary data format. Sequence iles
( http://wiki.apache.org/hadoop/SequenceFile ) contain a binary key-value store.
XML and JSON are also common data formats.
If we want to parse our own data formats in Cascading or Cascalog, we'll need to write our
own source tap ( http://docs.cascading.org/cascading/2.5/userguide/html/
ch03s05.html ). If it's a delimited text format, such as CSV or TSV, we can base the new tap
on cascading.scheme.hadoop.TextDelimited , just as we did in this recipe. See the
JavaDocs for this class at http://docs.cascading.org/cascading/2.5/cascading-
hadoop/cascading/scheme/hadoop/TextDelimited.html for more information on this.
Executing complex queries with Cascalog
So far, we've seen basic Cascalog predicates and queries. We saw queries that pull data from
one source generator and maybe include one predicate test. In this recipe, we'll see several
more complex queries.
Getting ready
For this recipe, we'll need the same project.clj ile and dependencies from the
Initializing Cascalog and Hadoop for distributed processing recipe. We'll also use the
Doctor Who companion data that we deined in that recipe. The source code for this
data is available in the code for the topic, and you can also download just the code from
http://www.ericrochester.com/clj-data-analysis/data/companions.clj
to create this dataset.
 
Search WWH ::




Custom Search