Spark SQL - Learning Spark

Database Reference

In-Depth Information

Loading and Saving Data

Spark SQL supports a number of structured data sources out of the box, letting you

get Row objects from them without any complicated loading process. These sources

include Hive tables, JSON, and Parquet files. In addition, if you query these sources

using SQL and select only a subset of the fields, Spark SQL can smartly scan only the

subset of the data for those fields, instead of scanning all the data like a naive Spark

Context.hadoopFile might.

Apart from these data sources, you can also convert regular RDDs in your program

to SchemaRDDs by assigning them a schema. This makes it easy to write SQL queries

even when your underlying data is Python or Java objects. Often, SQL queries are

more concise when you're computing many quantities at once (e.g., if you wanted to

compute the average age, max age, and count of distinct user IDs in one pass). In

addition, you can easily join these RDDs with SchemaRDDs from any other Spark

SQL data source. In this section, we'll cover the external sources as well as this way of

using RDDs.

Apache Hive

When loading data from Hive, Spark SQL supports any Hive-supported storage for‐

mats (SerDes), including text files, RCFiles, ORC, Parquet, Avro, and Protocol

Buffers.

To connect Spark SQL to an existing Hive installation, you need to provide a Hive

configuration. You do so by copying your hive-site.xml file to Spark's ./conf/ direc‐

tory. If you just want to explore, a local Hive metastore will be used if no hive-site.xml

is set, and we can easily load data into a Hive table to query later on.

Examples 9-15 through 9-17 illustrate querying a Hive table. Our example Hive table

has two columns, key (which is an integer) and value (which is a string). We show

how to create such a table later in this chapter .

Example 9-15. Hive load in Python

from pyspark.sql import HiveContext

hiveCtx = HiveContext ( sc )

rows = hiveCtx . sql ( "SELECT key, value FROM mytable" )

keys = rows . map ( lambda row : row [ 0 ])

Example 9-16. Hive load in Scala

import org.apache.spark.sql.hive.HiveContext

val hiveCtx = new HiveContext ( sc )

Search WWH ::

Custom Search

Home