Database Reference
In-Depth Information
Chapter 9
Data Research and Advanced Data Cleansing
with Pig and Hive
What You Will Learn in This Chapter
• Understanding the Difference Between Pig and Hive and When to Use
Each
• Using Pig Latin Built-in Functions for Advanced Extraction, and
Transforming and Loading Data
• Understanding the Various Types of Hive Functions Available
• Extending Hive with Map-reduce Scripts
• Creating Your Own Functions to Plug into Hive
All data processing on Hadoop essentially boils down to a map-reduce
process. The mapping consists of retrieving the data and performing
operations such as filtering and sorting. The reducing part of the process
consists of performing a summary operation such as grouping and counting.
Hadoop map-reduce jobs are written in programming languages such as
Java and C#. Although this works well for developers with a programming
background, it requires a steep learning curve for nonprogrammers. This
is where Pig comes in to play. Another tool available to create and run
map-reduce jobs in Hadoop is Hive. Like Pig, Hive relies on a batch-based,
parallel-processing paradigm and is useful for querying, aggregating, and
filtering large data sets.
This chapter covers both Pig and Hive and will help you to understand
the strengths of each. You will also see how to extend Pig and Hive using
functions and custom map-reduce scripts. In addition, the chapter includes
hands-on activities to help you solidify the concepts presented.
Getting to Know Pig
Pig was originally developed as a research project within Yahoo! in 2006. It
became popular with the user community as a way to increase productivity
when writing map-reduce jobs. By 2007, Yahoo! decided to work with the
Search WWH ::




Custom Search