Database Reference
In-Depth Information
Structured Data with Spark SQL
Spark SQL is a component added in Spark 1.0 that is quickly becoming Spark's pre‐
ferred way to work with structured and semistructured data. By structured data, we
mean data that has a schema —that is, a consistent set of fields across data records.
Spark SQL supports multiple structured data sources as input, and because it under‐
stands their schema, it can efficiently read only the fields you require from these data
sources. We will cover Spark SQL in more detail in Chapter 9 , but for now, we show
how to use it to load data from a few common sources.
In all cases, we give Spark SQL a SQL query to run on the data source (selecting some
fields or a function of the fields), and we get back an RDD of Row objects, one per
record. In Java and Scala, the Row objects allow access based on the column number.
Each Row has a get() method that gives back a general type we can cast, and specific
get() methods for common basic types (e.g., getFloat() , getInt() , getLong() , get
String() , getShort() , and getBoolean() ). In Python we can just access the ele‐
ments with row[column_number] and row.column_name .
Apache Hive
One common structured data source on Hadoop is Apache Hive. Hive can store
tables in a variety of formats, from plain text to column-oriented formats, inside
HDFS or other storage systems. Spark SQL can load any table supported by Hive.
To connect Spark SQL to an existing Hive installation, you need to provide a Hive
configuration. You do so by copying your hive-site.xml file to Spark's ./conf/ direc‐
tory. Once you have done this, you create a HiveContext object, which is the entry
point to Spark SQL, and you can write Hive Query Language (HQL) queries against
your tables to get data back as RDDs of rows. Examples 5-30 through 5-32
demonstrate.
Example 5-30. Creating a HiveContext and selecting data in Python
from pyspark.sql import HiveContext
hiveCtx = HiveContext ( sc )
rows = hiveCtx . sql ( "SELECT name, age FROM users" )
firstRow = rows . first ()
print firstRow . name
Example 5-31. Creating a HiveContext and selecting data in Scala
import org.apache.spark.sql.hive.HiveContext
val hiveCtx = new org . apache . spark . sql . hive . HiveContext ( sc )
val rows = hiveCtx . sql ( "SELECT name, age FROM users" )
Search WWH ::




Custom Search