Spark SQL - Learning Spark

Database Reference

In-Depth Information

Figure 9-1. Spark SQL usage

In this chapter, we'll start by showing how to use SchemaRDDs inside regular Spark

programs, to load and query structured data. We'll then describe the Spark SQL

JDBC server, which lets you run Spark SQL on a shared server and connect either

SQL shells or visualization tools like Tableau to it. Finally, we'll discuss some

advanced features. Spark SQL is a newer component of Spark and it will evolve sub‐

stantially in Spark 1.3 and future versions, so consult the most recent documentation

for the latest information on Spark SQL and SchemaRDDs.

As we move through this chapter, we'll use Spark SQL to explore a JSON file with

tweets. If you don't have any tweets on hand, you can use the Databricks reference

application to download some, or you can use files/testweet.json in the topic's Git

repo.

Linking with Spark SQL

As with the other Spark libraries, including Spark SQL in our application requires

some additional dependencies. This allows Spark Core to be built without depending

on a large number of additional packages.

Spark SQL can be built with or without Apache Hive, the Hadoop SQL engine. Spark

SQL with Hive support allows us to access Hive tables, UDFs (user-defined func‐

tions), SerDes (serialization and deserialization formats), and the Hive query lan‐

guage (HiveQL). Hive query language (HQL) It is important to note that including

the Hive libraries does not require an existing Hive installation. In general, it is best

to build Spark SQL with Hive support to access these features. If you download Spark

in binary form , it should already be built with Hive support. If you are building Spark

from source, you should run sbt/sbt -Phive assembly .

Search WWH ::

Custom Search

Home