Database Reference
In-Depth Information
Taps, Schemes, and Flows
In many of the previous diagrams, there are references to “sources” and “sinks.” In Cascad-
ing, all data is read from or written to
Tap
instances, but is converted to and from tuple in-
stances via
Scheme
objects:
Tap
A
Tap
is responsible for the “how” and “where” parts of accessing data. For example,
is the data on HDFS or the local filesystem? In Amazon S3 or over HTTP?
Scheme
A
Scheme
is responsible for reading raw data and converting it to a tuple and/or writ-
ing a tuple out into raw data, where this “raw” data can be lines of text, Hadoop binary
sequence files, or some proprietary format.
Note that
Tap
s are not part of a pipe assembly, and so they are not a type of
Pipe
. But
they are connected with pipe assemblies when they are made cluster executable. When a
pipe assembly is connected with the necessary number of source and sink
Tap
instances,
we get a
Flow
. The
Tap
s either emit or capture the field names the pipe assembly expects.
That is, if a
Tap
emits a tuple with the field name “line” (by reading data from a file on
HDFS), the head of the pipe assembly must be expecting a “line” value as well. Otherwise,
the process that connects the pipe assembly with the
Tap
s will immediately fail with an er-
ror.
So pipe assemblies are really data process definitions, and are not “executable” on their
own. They must be connected to source and sink
Tap
instances before they can run on a
cluster. This separation between
Tap
s and pipe assemblies is part of what makes Cascad-
ing so powerful.
If you think of a pipe assembly like a Java class, then a
Flow
is like a Java object instance
(
Figure 24-7
). That is, the same pipe assembly can be “instantiated” many times into new
Flow
s, in the same application, without fear of any interference between them. This allows
pipe assemblies to be created and shared like standard Java libraries.