Database Reference
In-Depth Information
Figure 9-1. Spark SQL usage
In this chapter, we'll start by showing how to use SchemaRDDs inside regular Spark
programs, to load and query structured data. We'll then describe the Spark SQL
JDBC server, which lets you run Spark SQL on a shared server and connect either
SQL shells or visualization tools like Tableau to it. Finally, we'll discuss some
advanced features. Spark SQL is a newer component of Spark and it will evolve sub‐
stantially in Spark 1.3 and future versions, so consult the most recent documentation
for the latest information on Spark SQL and SchemaRDDs.
As we move through this chapter, we'll use Spark SQL to explore a JSON file with
tweets. If you don't have any tweets on hand, you can use the Databricks reference
application to download some, or you can use files/testweet.json in the topic's Git
repo.
Linking with Spark SQL
As with the other Spark libraries, including Spark SQL in our application requires
some additional dependencies. This allows Spark Core to be built without depending
on a large number of additional packages.
Spark SQL can be built with or without Apache Hive, the Hadoop SQL engine. Spark
SQL with Hive support allows us to access Hive tables, UDFs (user-defined func‐
tions), SerDes (serialization and deserialization formats), and the Hive query lan‐
guage (HiveQL). Hive query language (HQL) It is important to note that including
the Hive libraries does not require an existing Hive installation. In general, it is best
to build Spark SQL with Hive support to access these features. If you download Spark
in binary form , it should already be built with Hive support. If you are building Spark
from source, you should run sbt/sbt -Phive assembly .
 
Search WWH ::




Custom Search