Database Reference
In-Depth Information
The Wikipedia Edit Stream
Wikipedia makes its edit stream available through an Internet Relay
Chat (IRC) channel for the entire world to consume as it sees fit. There
are always a fairly high volume of edits being performed on Wikipedia
at any given moment, making it a good source of “test” data when
learning to use real-time streaming applications. This data stream is
used as a working example throughout this chapter.
To get the data, an application is needed to watch the IRC data stream.
Fortunately, the Samza project from Chapter 5, “Processing Streaming
Data,” includes a Wikipedia IRC reader that it then streams into a
Kafka queue. It is included as part of the Hello Samza project. To get
this running, first check out the introductory code from Github and
start the included Samza grid:
$ git clone https://github.com/apache/
incubator-samza-hello-samza.git
$ cd incubator-samza-hello-samza/
$ ./bin/grid bootstrap
.. Output Removed ...
EXECUTING: start zookeeper
JMX enabled by default
Using config: /Users/bellis/Projects/
incubator-samza-hello-
samza/deploy/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
EXECUTING: start yarn
EXECUTING: start kafka
Samza is a fairly young project and the code base is moving quickly. As
such, it is possible that the public repository has changed in such a way
that the examples in this topic no longer work. If this is the case, the
code included with this topic also includes a copy of both the
incubator-samza and hello-samza projects at the time of writing.
To use them, simply unpack the archive and copy incubator-samza
project into the samza download directory. On most Unix-like
Search WWH ::




Custom Search