Information Technology Reference
In-Depth Information
EXERCISE 9.4 (continued)
The user can view the data inside the HDFS directory ( /user/myuser/grouped ) using the
Hadoop command. The user can also control the reduce task parallelism by replacing the
group statement in the group.pig script as follows. The map task parallelism is dynami-
cally determined by the input file Usually, there's one map for each HDFS block (default
block HDFS size is 128 MB).
grpd = group divs by symbol parallel 12;
Now let us assume you have another file called NYSE_daily.json , which represents
today's stock trade. What you would like to see is which stocks from NYSE_dividends
.json were seen traded today as well. This can be accomplished by doing a join of the
two data sets, as shown in the join.pig script. The parallelism option for reduce tasks
can be used in Pig with any aggregation or accumulation operation, including group,
co-group, and join.
--join.pig
daily = load 'hdfs://localhost:9000/user/myuser/NYSE_daily.json'
using JsonLoader('date:chararray, dividends:chararray, symbol:chararray,
exchange:chararray');
divs = load 'hdfs://localhost:9000/user/myuser/NYSE_dividends.json'
using JsonLoader('date:chararray, dividends:chararray, symbol:chararray,
exchange:chararray');
jnd = join daily by symbol, divs by symbol;
store jnd into 'hdfs://localhost:9000/user/myuser/joined' using JsonStorage();
Choosing from among These Technologies
An obvious problem with the presence of a number of good technologies is the decision-
making process. Deciding which technology to use is not always trivial. In fact, such a deci-
sion could prove to be critical in terms of an organization's cloud application life cycle.
The decision for a cloud storage technology should take into account three very
important factors:
Nature of the data to be stored (or the type):
Is it corporate data, such as customer data, company data, or product data?
Is it log data, such as logging or machine-generated data?
Search WWH ::




Custom Search