Databases Reference
In-Depth Information
Pig
Pig is a high-level platform for processing Big Data on Hadoop clusters. Pig consists of a data flow
language, called Pig Latin , supporting writing queries on large datasets and an execution environ-
ment running programs from a console. The Pig Latin programs consist of dataset transformation se-
ries converted under the hood to a MapReduce program series. Pig Latin abstractions provide richer
data structures than MapReduce; they perform for Hadoop what SQL performs for RDBMS systems.
Pig Latin is fully extensible. User Defined Functions (UDFs), written in Java, Python, C#, or JavaScript,
can be called to customize each processing path stage when composing the analysis.
Hive
Hive is the glue between the world of Hadoop and the world of BI In effect you can make Hadoop
look like another relational data source. Hive is for analysts with strong SQL skills, providing an SQL-
like interface and a relational data model. Hive uses a language called HiveQL ; a dialect of SQL. Hive,
like Pig, is an abstraction on top of MapReduce, and when run, Hive translates queries into a series of
MapReduce jobs. Scenarios for Hive are closer in concept to those for RDBMS; thus, they are appro-
priate for use with more structured data. For unstructured data, Pig is a better choice. The HDInsight
Service includes an ODBC driver for Hive, which provides direct real-time querying from BI tools such
as Microsoft Excel into Hadoop.
Other tools
Sqoop is a tool that transfers bulk data between Hadoop and relational databases such a SQL or other
structured data stores as efficiently as possible. Use Sqoop to import data from external structured
data stores into the HDFS or related systems such as Hive. Sqoop can also extract data from Hadoop
and export that data to external relational databases, enterprise data warehouses, or any other struc-
tured data store type.
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and
moving large log data amounts to HDFS. Flume's architecture is streaming data flow-based. It is ro-
bust and fault tolerant with tunable and reliability mechanisms as well as many failover and recovery
mechanisms. It has a simple extensible data model, enabling online analytical applications.
Mahout is an open-source machine-learning library that facilitates building scalable matching
learning libraries. Using the map/reduce paradigm, algorithms for clustering, classification and batch-
based collaborative filtering developed for Mahout are implemented on top of Apache Hadoop.
What is NoSQL?
A NoSQL database provides a simple, lightweight mechanism for storage and retrieval of data that
provides higher scalability and availability than traditional relational databases. The NoSQL data
stores use looser consistency models to achieve horizontal scaling and higher availability.
Search WWH ::




Custom Search