Database Reference
In-Depth Information
Loading and Saving Data
Spark SQL supports a number of structured data sources out of the box, letting you
get Row objects from them without any complicated loading process. These sources
include Hive tables, JSON, and Parquet files. In addition, if you query these sources
using SQL and select only a subset of the fields, Spark SQL can smartly scan only the
subset of the data for those fields, instead of scanning all the data like a naive Spark
Context.hadoopFile might.
Apart from these data sources, you can also convert regular RDDs in your program
to SchemaRDDs by assigning them a schema. This makes it easy to write SQL queries
even when your underlying data is Python or Java objects. Often, SQL queries are
more concise when you're computing many quantities at once (e.g., if you wanted to
compute the average age, max age, and count of distinct user IDs in one pass). In
addition, you can easily join these RDDs with SchemaRDDs from any other Spark
SQL data source. In this section, we'll cover the external sources as well as this way of
using RDDs.
Apache Hive
When loading data from Hive, Spark SQL supports any Hive-supported storage for‐
mats (SerDes), including text files, RCFiles, ORC, Parquet, Avro, and Protocol
Buffers.
To connect Spark SQL to an existing Hive installation, you need to provide a Hive
configuration. You do so by copying your hive-site.xml file to Spark's ./conf/ direc‐
tory. If you just want to explore, a local Hive metastore will be used if no hive-site.xml
is set, and we can easily load data into a Hive table to query later on.
Examples 9-15 through 9-17 illustrate querying a Hive table. Our example Hive table
has two columns, key (which is an integer) and value (which is a string). We show
how to create such a table later in this chapter .
Example 9-15. Hive load in Python
from pyspark.sql import HiveContext
hiveCtx = HiveContext ( sc )
rows = hiveCtx . sql ( "SELECT key, value FROM mytable" )
keys = rows . map ( lambda row : row [ 0 ])
Example 9-16. Hive load in Scala
import org.apache.spark.sql.hive.HiveContext
val hiveCtx = new HiveContext ( sc )
Search WWH ::




Custom Search