Spark SQL - Learning Spark

Database Reference

In-Depth Information

CHAPTER 9

Spark SQL

This chapter introduces Spark SQL, Spark's interface for working with structured and

semistructured data. Structured data is any data that has a schema —that is, a known

set of fields for each record. When you have this type of data, Spark SQL makes it

both easier and more efficient to load and query. In particular, Spark SQL provides

three main capabilities (illustrated in Figure 9-1 ):

1. It can load data from a variety of structured sources (e.g., JSON, Hive, and

Parquet).

2. It lets you query the data using SQL, both inside a Spark program and from

external tools that connect to Spark SQL through standard database connectors

(JDBC/ODBC), such as business intelligence tools like Tableau.

3. When used within a Spark program, Spark SQL provides rich integration

between SQL and regular Python/Java/Scala code, including the ability to join

RDDs and SQL tables, expose custom functions in SQL, and more. Many jobs are

easier to write using this combination.

To implement these capabilities, Spark SQL provides a special type of RDD called

SchemaRDD. A SchemaRDD is an RDD of Row objects, each representing a record. A

SchemaRDD also knows the schema (i.e., data fields) of its rows. While SchemaRDDs

look like regular RDDs, internally they store data in a more efficient manner, taking

advantage of their schema. In addition, they provide new operations not available on

RDDs, such as the ability to run SQL queries. SchemaRDDs can be created from

external data sources, from the results of queries, or from regular RDDs.

Search WWH ::

Custom Search

Home