Database Reference
In-Depth Information
If you have dependency conflicts with Hive that you cannot solve through exclusions
or shading, you can also build and link to Spark SQL without Hive. In that case you
link to a separate Maven artifact.
In Java and Scala, the Maven coordinates to link to Spark SQL with Hive are shown
in Example 9-1 .
Example 9-1. Maven coordinates for Spark SQL with Hive support
groupId = org.apache.spark
artifactId = spark-hive_2.10
version = 1.2.0
If you can't include the Hive dependencies, use the artifact ID spark-sql_2.10
instead of spark-hive_2.10 .
As with the other Spark libraries, in Python no changes to your build are required.
When programming against Spark SQL we have two entry points depending on
whether we need Hive support. The recommended entry point is the HiveContext to
provide access to HiveQL and other Hive-dependent functionality. The more basic
SQLContext provides a subset of the Spark SQL support that does not depend on
Hive. The separation exists for users who might have conflicts with including all of
the Hive dependencies. Using a HiveContext does not require an existing Hive setup.
HiveQL is the recommended query language for working with Spark SQL. Many
resources have been written on HiveQL, including Programming Hive and the online
Hive Language Manual . In Spark 1.0 and 1.1, Spark SQL is based on Hive 0.12,
whereas in Spark 1.2 it also supports Hive 0.13. If you already know standard SQL,
using HiveQL should feel very similar.
Spark SQL is a newer and fast moving component of Spark. The set
of compatible Hive versions may change in the future, so consult
the most recent documentation for more details.
Finally, to connect Spark SQL to an existing Hive installation, you must copy your
hive-site.xml file to Spark's configuration directory ( $SPARK_HOME/conf ). If you
don't have an existing Hive installation, Spark SQL will still run.
Search WWH ::




Custom Search