Database Reference
In-Depth Information
Structured Data with Spark SQL
Spark SQL is a component added in Spark 1.0 that is quickly becoming Spark's pre‐
ferred way to work with structured and semistructured data. By structured data, we
mean data that has a
schema
—that is, a consistent set of fields across data records.
Spark SQL supports multiple structured data sources as input, and because it under‐
stands their schema, it can efficiently read only the fields you require from these data
sources. We will cover Spark SQL in more detail in
Chapter 9
, but for now, we show
how to use it to load data from a few common sources.
In all cases, we give Spark SQL a SQL query to run on the data source (selecting some
fields or a function of the fields), and we get back an RDD of
Row
objects, one per
record. In Java and Scala, the
Row
objects allow access based on the column number.
Each
Row
has a
get()
method that gives back a general type we can cast, and specific
get()
methods for common basic types (e.g.,
getFloat()
,
getInt()
,
getLong()
,
get
String()
,
getShort()
, and
getBoolean()
). In Python we can just access the ele‐
ments with
row[column_number]
and
row.column_name
.
Apache Hive
One common structured data source on Hadoop is Apache Hive. Hive can store
tables in a variety of formats, from plain text to column-oriented formats, inside
HDFS or other storage systems. Spark SQL can load any table supported by Hive.
To connect Spark SQL to an existing Hive installation, you need to provide a Hive
configuration. You do so by copying your
hive-site.xml
file to Spark's
./conf/
direc‐
tory. Once you have done this, you create a
HiveContext
object, which is the entry
point to Spark SQL, and you can write Hive Query Language (HQL) queries against
your tables to get data back as RDDs of rows. Examples
5-30
through
5-32
demonstrate.
Example 5-30. Creating a HiveContext and selecting data in Python
from
pyspark.sql
import
HiveContext
hiveCtx
=
HiveContext
(
sc
)
rows
=
hiveCtx
.
sql
(
"SELECT name, age FROM users"
)
firstRow
=
rows
.
first
()
print
firstRow
.
name
Example 5-31. Creating a HiveContext and selecting data in Scala
import
org.apache.spark.sql.hive.HiveContext
val
hiveCtx
=
new
org
.
apache
.
spark
.
sql
.
hive
.
HiveContext
(
sc
)
val
rows
=
hiveCtx
.
sql
(
"SELECT name, age FROM users"
)