Storage Provisioning and Networking - Deploying and Managing a Cloud Infrastructure

Information Technology Reference

In-Depth Information

EXERCISE 9.4 (continued)

The user can view the data inside the HDFS directory ( /user/myuser/grouped ) using the

Hadoop command. The user can also control the reduce task parallelism by replacing the

group statement in the group.pig script as follows. The map task parallelism is dynami-

cally determined by the input file Usually, there's one map for each HDFS block (default

block HDFS size is 128 MB).

grpd = group divs by symbol parallel 12;

Now let us assume you have another file called NYSE_daily.json , which represents

today's stock trade. What you would like to see is which stocks from NYSE_dividends

.json were seen traded today as well. This can be accomplished by doing a join of the

two data sets, as shown in the join.pig script. The parallelism option for reduce tasks

can be used in Pig with any aggregation or accumulation operation, including group,

co-group, and join.

--join.pig

daily = load 'hdfs://localhost:9000/user/myuser/NYSE_daily.json'

using JsonLoader('date:chararray, dividends:chararray, symbol:chararray,

exchange:chararray');

divs = load 'hdfs://localhost:9000/user/myuser/NYSE_dividends.json'

using JsonLoader('date:chararray, dividends:chararray, symbol:chararray,

exchange:chararray');

jnd = join daily by symbol, divs by symbol;

store jnd into 'hdfs://localhost:9000/user/myuser/joined' using JsonStorage();

Choosing from among These Technologies

An obvious problem with the presence of a number of good technologies is the decision-

making process. Deciding which technology to use is not always trivial. In fact, such a deci-

sion could prove to be critical in terms of an organization's cloud application life cycle.

The decision for a cloud storage technology should take into account three very

important factors:

Nature of the data to be stored (or the type):

■

Is it corporate data, such as customer data, company data, or product data?

■

Is it log data, such as logging or machine-generated data?

■

Search WWH ::

Custom Search

Home