Database Reference
In-Depth Information
Initializing Cascalog and Hadoop for
distributed processing
Hadoop was developed by Yahoo! to implement Google's MapReduce algorithm, and then it
was open sourced. Since then, it's become one of the most widely tested and used systems
for creating distributed processing.
The central part of this ecosystem is Hadoop, but it's also complemented by a range of other
tools, including the Hadoop Distributed File System (HDFS) and Pig, a language used to write
jobs in order to run them on Hadoop.
One tool that makes working with Hadoop easier is Cascading. This provides a
worklow-like layer on top of Hadoop that can make the expression of some data processing
and analysis tasks much easier. Cascalog is a Clojure-idiomatic interface to Cascading and,
ultimately, Hadoop.
This recipe will show you how to access and query data in Clojure sequences using Cascalog.
Getting ready
First, we have to list our dependencies in the Leiningen project.clj ile:
(defproject distrib-data "0.1.0"
:dependencies [[org.clojure/clojure "1.6.0"]
[cascalog "2.1.1"]
[org.slf4j/slf4j-api "1.7.7"]]
:profiles {:dev
{:dependencies
[[org.apache.hadoop/hadoop-core "1.1.1"]]}})
Finally, we'll require the packages that we'll use, including the clojure.string library:
(require '[clojure.string :as string])
(require '[cascalog.logic.ops :as c])
(use 'cascalog.api)
 
Search WWH ::




Custom Search