Introducing Big Data Technologies - Data Warehousing in the Age of Big Data

Databases Reference

In-Depth Information

SPLIT: Split data into two or more sets, based on filter conditions.

STREAM: Send all records through a user-provided binary.

DUMP: Write output to stdout.

LIMIT: Limit the number of records.

During program execution, Pig first validates the syntax and semantics of statements and contin-

ues to process them; when it encounters a DUMP or STORE it completes the execution of the state-

ment. For example, a Pig job to process compliance logs and extract words and phrases will look like

the following”

A = load 'compliance_log';

B = foreach A generate

flatten(TOKENIZE((chararray)$0)) as word;

C = filter B by word matches '\\w+';

D = group C by word;

E = foreach D generate COUNT(C), group;

store E into 'compliance_log_freq';

Now let us say that we want to analyze how many of these words are in FDA mandates:

A = load 'FDA_Data';

B = foreach A generate

flatten(TOKENIZE((chararray)$0)) as word;

C = filter B by word matches '\\w+';

D = group C by word;

E = foreach D generate COUNT(C), group;

store E into 'FDA_Data_freq';

We can then join these two outputs to create a result set:

.compliance = LOAD 'compliance_log_freq' AS (freq, word);

FDA = LOAD 'FDA_Data_freq' AS (freq, word);

inboth = JOIN compliance BY word, FDA BY word;

STORE inboth INTO 'output';

In this example the Food and Drug Administration (FDA) data is highly semi-structured and com-

pliance logs are generated by multiple applications. Processing large data sets with simple lines of

code is what Pig brings to MapReduce and Hadoop data processing.

Though Pig is very powerful, it cannot be used on small data sets or a transactional type of data.

Its adoption to mainstream is still evolving. In the near future, I think Pig will be used more in data

collection and preprocessing environments and in streaming data processing environments.

HBase

HBase is an open-source, nonrelational, column-oriented, multidimensional, distributed database

developed on Google's BigTable architecture. It is designed with high availability and high perfor-

mance as drivers to support storage and processing of large data sets on the Hadoop framework.

HBase is not a database in the purist definition of a database. It provides unlimited scalability and

performance and supports certain features of an ACID-compliant database. HBase is classified as a

NoSQL database due to its architecture and design being closely aligned to Base (Being Available

and Same Everywhere).

Search WWH ::

Custom Search

Home