Advance Concepts in Storm - Real-time Analytics with Storm and Cassandra

Database Reference

In-Depth Information

Examples and illustrations

One of the other out-of-the-box and popular implementations of Trident is reach topology,

which is a pure DRPC topology that finds the reach of a URL on demand. Let's first under-

stand some of the jargon before we delve deeper.

Reach is basically a sum total of the count of Twitter users exposed to a URL.

Reach computation is a multistep process that can be attained by the following examples:

• Get all the users who have ever tweeted a URL

• Fetch the follower tree of each of these users

• Assemble the huge follower sets fetched previously

• Count the set

Well, looking at the skeletal algorithm entailed previously, you can make out that it is bey-

ond the capability of a single machine and we'd need a distributed compute engine to

achieve it. It's an ideal candidate of the Storm Trident framework, as you have the capabil-

ity to execute highly parallel computations at each step across the cluster.

• Our Trident reach topology would be sucking data from two large data banks

• Bank A is the URL to the originator bank, wherein all the URLs would be stored

along with the name of the user who had tweeted them

• Bank B is the user follower bank; this data bank will have a user to follow the

mapping for all Twitter users

The topology would be defined as follows:

TridentState urlToTweeterState =

topology.newStaticState(getUrlToTweetersState());

TridentState tweetersToFollowerState =

topology.newStaticState(getTweeterToFollowersState());

topology.newDRPCStream("reach")

.stateQuery(urlToTweeterState, new Fields("args"),

new MapGet(), new Fields("tweeters"))

.each(new Fields("tweeters"), new ExpandList(), new

Fields("tweeter"))

.shuffle()

.stateQuery(tweetersToFollowerState, new

Search WWH ::

Custom Search

Home