Microsoft and “Big Data” - Business Intelligence in Microsoft SharePoint 2013

Databases Reference

In-Depth Information

Pig

Pig is a high-level platform for processing Big Data on Hadoop clusters. Pig consists of a data flow

language, called Pig Latin , supporting writing queries on large datasets and an execution environ-

ment running programs from a console. The Pig Latin programs consist of dataset transformation se-

ries converted under the hood to a MapReduce program series. Pig Latin abstractions provide richer

data structures than MapReduce; they perform for Hadoop what SQL performs for RDBMS systems.

Pig Latin is fully extensible. User Defined Functions (UDFs), written in Java, Python, C#, or JavaScript,

can be called to customize each processing path stage when composing the analysis.

Hive

Hive is the glue between the world of Hadoop and the world of BI In effect you can make Hadoop

look like another relational data source. Hive is for analysts with strong SQL skills, providing an SQL-

like interface and a relational data model. Hive uses a language called HiveQL ; a dialect of SQL. Hive,

like Pig, is an abstraction on top of MapReduce, and when run, Hive translates queries into a series of

MapReduce jobs. Scenarios for Hive are closer in concept to those for RDBMS; thus, they are appro-

priate for use with more structured data. For unstructured data, Pig is a better choice. The HDInsight

Service includes an ODBC driver for Hive, which provides direct real-time querying from BI tools such

as Microsoft Excel into Hadoop.

Other tools

Sqoop is a tool that transfers bulk data between Hadoop and relational databases such a SQL or other

structured data stores as efficiently as possible. Use Sqoop to import data from external structured

data stores into the HDFS or related systems such as Hive. Sqoop can also extract data from Hadoop

and export that data to external relational databases, enterprise data warehouses, or any other struc-

tured data store type.

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and

moving large log data amounts to HDFS. Flume's architecture is streaming data flow-based. It is ro-

bust and fault tolerant with tunable and reliability mechanisms as well as many failover and recovery

mechanisms. It has a simple extensible data model, enabling online analytical applications.

Mahout is an open-source machine-learning library that facilitates building scalable matching

learning libraries. Using the map/reduce paradigm, algorithms for clustering, classification and batch-

based collaborative filtering developed for Mahout are implemented on top of Apache Hadoop.

What is NoSQL?

A NoSQL database provides a simple, lightweight mechanism for storage and retrieval of data that

provides higher scalability and availability than traditional relational databases. The NoSQL data

stores use looser consistency models to achieve horizontal scaling and higher availability.

Search WWH ::

Custom Search

Home