Database Reference
In-Depth Information
Taps, Schemes, and Flows
In many of the previous diagrams, there are references to “sources” and “sinks.” In Cascad-
ing, all data is read from or written to Tap instances, but is converted to and from tuple in-
stances via Scheme objects:
Tap
A Tap is responsible for the “how” and “where” parts of accessing data. For example,
is the data on HDFS or the local filesystem? In Amazon S3 or over HTTP?
Scheme
A Scheme is responsible for reading raw data and converting it to a tuple and/or writ-
ing a tuple out into raw data, where this “raw” data can be lines of text, Hadoop binary
sequence files, or some proprietary format.
Note that Tap s are not part of a pipe assembly, and so they are not a type of Pipe . But
they are connected with pipe assemblies when they are made cluster executable. When a
pipe assembly is connected with the necessary number of source and sink Tap instances,
we get a Flow . The Tap s either emit or capture the field names the pipe assembly expects.
That is, if a Tap emits a tuple with the field name “line” (by reading data from a file on
HDFS), the head of the pipe assembly must be expecting a “line” value as well. Otherwise,
the process that connects the pipe assembly with the Tap s will immediately fail with an er-
ror.
So pipe assemblies are really data process definitions, and are not “executable” on their
own. They must be connected to source and sink Tap instances before they can run on a
cluster. This separation between Tap s and pipe assemblies is part of what makes Cascad-
ing so powerful.
If you think of a pipe assembly like a Java class, then a Flow is like a Java object instance
( Figure 24-7 ). That is, the same pipe assembly can be “instantiated” many times into new
Flow s, in the same application, without fear of any interference between them. This allows
pipe assemblies to be created and shared like standard Java libraries.
Search WWH ::




Custom Search