Introducing Big Data Technologies - Data Warehousing in the Age of Big Data

Databases Reference

In-Depth Information

failover. To manage this, the client session ID information is associated with child zNodes and locks,

which will enable the failover client to synchronize.

Zookeeper is a highly available system, and it is critical that it can perform its functions in a

timely manner. It is recommended to run Zookeeper on dedicated machines. Running it in a shared-

services environment will adversely impact performance.

In Hadoop deployment, Zookeeper serves as the coordinator for managing all the key activities:

●

Manage configuration across nodes . Zookeeper helps you quickly push configuration changes

across dozens or hundreds of nodes.

●

Implement reliable messaging. A guaranteed messaging architecture to deliver messages can be

implemented with Zookeeper.

●

Implement redundant services. Managing a large number of nodes with a Zab approach will

provide a scalable redundancy management solution.

●

Synchronize process execution. With Zookeeper, multiple nodes can coordinate the start and end

of a process or calculation. This approach can ensure consistency of completion of operations.

Please see the Zookeeper configuration and administrators guide for further details

( http://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html ).

Pig

Analyzing large data sets introduces dataflow complexities that become harder to implement in a

MapReduce program as data volumes and processing complexities increase. A high-level language

that is more user friendly, is SQL-like in terms of expressing dataflows, has the flexibility to manage

multistep data transformations, and handles joins with simplicity and easy program flow, was needed

as an abstraction layer over MapReduce.

Apache Pig is a platform that has been designed and developed for analyzing large data sets. Pig

consists of a high-level language for expressing data analysis programs and comes with infrastructure

for evaluating these programs. At the time of writing, Pig's current infrastructure consists of a com-

piler that produces sequences of MapReduce programs. Pig's language architecture is a textual lan-

guage platform called Pig Latin, of which the design goals were based on the requirement to handle

large data processing with minimal complexity and include:

●

Programming flexibility. . The ability to break down complex tasks comprised of multiple steps and

interprocess-related data transformations should be encoded as dataflow sequences that are easy to

design, develop, and maintain.

●

Automatic optimization . Tasks are encoded to let the system optimize their execution

automatically. This allows the user to have greater focus on program development, allowing the

user to focus on semantics rather than efficiency.

●

Extensibility. Users can develop user-defined functions (UDFs) for more complex processing

requirements.

Programming with pig latin

Pig is primarily a scripting language for exploring large data sets. It is developed to process multiple

terabytes of data in a half-dozen lines of Pig Latin code. Pig provides several commands to the devel-

oper for introspecting the data structures in the program, as it is written.

Data Warehousing in the Age of Big Data

Search WWH ::

Custom Search

Home