Database Reference
In-Depth Information
to the fact that it essentially solved the fundamental management problem
of large scale batch environments: binding computation to the data.
In large-scale ETL pipelines, the individual operations to be performed on
the data are usually fairly trivial. In database terms, they can usually be
boiled down to a sequence of GROUP BY statements in the WHERE clause
with SUM and perhaps DISTINCT statements in the SELECT clause. The
larger issue is managing the flow of data around a cluster of machines in a
way that allows for efficient processing.
Ingesting Data from Kafka
Kafka comes with a very simple ingestion mechanism for Hadoop.
Unfortunately, it's a bit too simple to use in a production environment
with any confidence. A better choice is the Camus tool, which LinkedIn
uses for its own Kafka-to-Hadoop ingestion. Although it's currently used in
production, Camus's open source lifecycle is still fairly new, so there is no
prepackage library available at the time of writing.
By default, Camus assumes that the data to be processed looks a lot like
LinkedIn's own internal data format, which is Avro based. Most people do
not work at LinkedIn, so this typically requires building a custom importer
that understands what to do with the data.
To get started building a custom importer, first check out the Camus
repository from Github so the required dependencies can be installed into
a local Maven repository. This topic uses Kafka 0.8, which has not yet been
merged to the master branch:
$ git clone https://github.com/linkedin/camus
Cloning into 'camus'...
remote: Counting objects: 2539, done.
remote: Compressing objects: 100% (991/991), done.
remote: Total 2539 (delta 774), reused 2393 (delta 661)
Receiving objects: 100% (2539/2539), 37.99 MiB |
950.00 KiB/s, done.
Resolving deltas: 100% (774/774), done.
Checking connectivity... done
$ cd camus/
$ git checkout camus-kafka-0.8
Branch camus-kafka-0.8 set up to track remote branch
Search WWH ::




Custom Search