Databases Reference
In-Depth Information
SPLIT: Split data into two or more sets, based on filter conditions.
STREAM: Send all records through a user-provided binary.
DUMP: Write output to stdout.
LIMIT: Limit the number of records.
During program execution, Pig first validates the syntax and semantics of statements and contin-
ues to process them; when it encounters a DUMP or STORE it completes the execution of the state-
ment. For example, a Pig job to process compliance logs and extract words and phrases will look like
the following”
A = load 'compliance_log';
B = foreach A generate
flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\\w+';
D = group C by word;
E = foreach D generate COUNT(C), group;
store E into 'compliance_log_freq';
Now let us say that we want to analyze how many of these words are in FDA mandates:
A = load 'FDA_Data';
B = foreach A generate
flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\\w+';
D = group C by word;
E = foreach D generate COUNT(C), group;
store E into 'FDA_Data_freq';
We can then join these two outputs to create a result set:
.compliance = LOAD 'compliance_log_freq' AS (freq, word);
FDA = LOAD 'FDA_Data_freq' AS (freq, word);
inboth = JOIN compliance BY word, FDA BY word;
STORE inboth INTO 'output';
In this example the Food and Drug Administration (FDA) data is highly semi-structured and com-
pliance logs are generated by multiple applications. Processing large data sets with simple lines of
code is what Pig brings to MapReduce and Hadoop data processing.
Though Pig is very powerful, it cannot be used on small data sets or a transactional type of data.
Its adoption to mainstream is still evolving. In the near future, I think Pig will be used more in data
collection and preprocessing environments and in streaming data processing environments.
HBase
HBase is an open-source, nonrelational, column-oriented, multidimensional, distributed database
developed on Google's BigTable architecture. It is designed with high availability and high perfor-
mance as drivers to support storage and processing of large data sets on the Hadoop framework.
HBase is not a database in the purist definition of a database. It provides unlimited scalability and
performance and supports certain features of an ACID-compliant database. HBase is classified as a
NoSQL database due to its architecture and design being closely aligned to Base (Being Available
and Same Everywhere).
Search WWH ::




Custom Search