Spark SQL - Learning Spark

Database Reference

In-Depth Information

If you have dependency conflicts with Hive that you cannot solve through exclusions

or shading, you can also build and link to Spark SQL without Hive. In that case you

link to a separate Maven artifact.

In Java and Scala, the Maven coordinates to link to Spark SQL with Hive are shown

in Example 9-1 .

Example 9-1. Maven coordinates for Spark SQL with Hive support

groupId = org.apache.spark

artifactId = spark-hive_2.10

version = 1.2.0

If you can't include the Hive dependencies, use the artifact ID spark-sql_2.10

instead of spark-hive_2.10 .

As with the other Spark libraries, in Python no changes to your build are required.

When programming against Spark SQL we have two entry points depending on

whether we need Hive support. The recommended entry point is the HiveContext to

provide access to HiveQL and other Hive-dependent functionality. The more basic

SQLContext provides a subset of the Spark SQL support that does not depend on

Hive. The separation exists for users who might have conflicts with including all of

the Hive dependencies. Using a HiveContext does not require an existing Hive setup.

HiveQL is the recommended query language for working with Spark SQL. Many

resources have been written on HiveQL, including Programming Hive and the online

Hive Language Manual . In Spark 1.0 and 1.1, Spark SQL is based on Hive 0.12,

whereas in Spark 1.2 it also supports Hive 0.13. If you already know standard SQL,

using HiveQL should feel very similar.

Spark SQL is a newer and fast moving component of Spark. The set

of compatible Hive versions may change in the future, so consult

the most recent documentation for more details.

Finally, to connect Spark SQL to an existing Hive installation, you must copy your

hive-site.xml file to Spark's configuration directory ( $SPARK_HOME/conf ). If you

don't have an existing Hive installation, Spark SQL will still run.

Search WWH ::

Custom Search

Home