Advance Concepts in Storm - Real-time Analytics with Storm and Cassandra

Database Reference

In-Depth Information

Fields("tweeter"), new MapGet(), new Fields("followers"))

.parallelismHint(200)

.each(new Fields("followers"), new ExpandList(),

new Fields("follower"))

.groupBy(new Fields("follower"))

.aggregate(new One(), new Fields("one"))

.parallelismHint(20)

.aggregate(new Count(), new Fields("reach"));

In the preceding topology, we perform the following steps:

1. Create a TridentState object for both data banks (URL to the originator

Bank A and users to follow Bank B).

2. The newStaticState method is used for the instantiation of state objects for

data banks; we have the capability to run the DRPC queries over the source states

created earlier.

3. In execution, when the reach of a URL is to be computed, we perform a query us-

ing the Trident state for data bank A to fetch the list of all the users who have ever

tweeted with this URL.

4. The ExpandList function creates and emits one tuple for each of the tweeters

of the URL in query.

5. Next, we fetch the follower of each tweeter fetched previously. This step needs

the highest degree of parallelism, thus we use shuffle grouping here for even load

distribution across all instances of the bolt. In our reach topology, this is the most

intense compute step.

6. Once we have the list of followers of the tweeter of the URL, we execute an oper-

ation analog to filter unique followers only.

7. We arrive at unique followers by grouping them together and then using the one

aggregator. The latter simply emits 1 for each group and in the next step all these

are counted together to arrive at the reach.

8. Then we count the followers (unique) thus arriving at the reach of the URL.

Search WWH ::

Custom Search

Home