Introduction to Data Analysis with Spark - Learning Spark

Database Reference

In-Depth Information

Figure 1-1. The Spark stack

Spark Core

Spark Core contains the basic functionality of Spark, including components for task

scheduling, memory management, fault recovery, interacting with storage systems,

and more. Spark Core is also home to the API that defines resilient distributed data‐

sets (RDDs), which are Spark's main programming abstraction. RDDs represent a

collection of items distributed across many compute nodes that can be manipulated

in parallel. Spark Core provides many APIs for building and manipulating these

collections.

Spark SQL

Spark SQL is Spark's package for working with structured data. It allows querying

data via SQL as well as the Apache Hive variant of SQL—called the Hive Query Lan‐

guage (HQL)—and it supports many sources of data, including Hive tables, Parquet,

and JSON. Beyond providing a SQL interface to Spark, Spark SQL allows developers

to intermix SQL queries with the programmatic data manipulations supported by

RDDs in Python, Java, and Scala, all within a single application, thus combining SQL

with complex analytics. This tight integration with the rich computing environment

provided by Spark makes Spark SQL unlike any other open source data warehouse

tool. Spark SQL was added to Spark in version 1.0.

Shark was an older SQL-on-Spark project out of the University of California, Berke‐

ley, that modified Apache Hive to run on Spark. It has now been replaced by Spark

SQL to provide better integration with the Spark engine and language APIs.

Spark Streaming

Spark Streaming is a Spark component that enables processing of live streams of data.

Examples of data streams include logfiles generated by production web servers, or

queues of messages containing status updates posted by users of a web service. Spark

Search WWH ::

Custom Search

Home