Loading and Saving Your Data - Learning Spark

Database Reference

In-Depth Information

Structured Data with Spark SQL

Spark SQL is a component added in Spark 1.0 that is quickly becoming Spark's pre‐

ferred way to work with structured and semistructured data. By structured data, we

mean data that has a schema —that is, a consistent set of fields across data records.

Spark SQL supports multiple structured data sources as input, and because it under‐

stands their schema, it can efficiently read only the fields you require from these data

sources. We will cover Spark SQL in more detail in Chapter 9 , but for now, we show

how to use it to load data from a few common sources.

In all cases, we give Spark SQL a SQL query to run on the data source (selecting some

fields or a function of the fields), and we get back an RDD of Row objects, one per

record. In Java and Scala, the Row objects allow access based on the column number.

Each Row has a get() method that gives back a general type we can cast, and specific

get() methods for common basic types (e.g., getFloat() , getInt() , getLong() , get

String() , getShort() , and getBoolean() ). In Python we can just access the ele‐

ments with row[column_number] and row.column_name .

Apache Hive

One common structured data source on Hadoop is Apache Hive. Hive can store

tables in a variety of formats, from plain text to column-oriented formats, inside

HDFS or other storage systems. Spark SQL can load any table supported by Hive.

To connect Spark SQL to an existing Hive installation, you need to provide a Hive

configuration. You do so by copying your hive-site.xml file to Spark's ./conf/ direc‐

tory. Once you have done this, you create a HiveContext object, which is the entry

point to Spark SQL, and you can write Hive Query Language (HQL) queries against

your tables to get data back as RDDs of rows. Examples 5-30 through 5-32

demonstrate.

Example 5-30. Creating a HiveContext and selecting data in Python

from pyspark.sql import HiveContext

hiveCtx = HiveContext ( sc )

rows = hiveCtx . sql ( "SELECT name, age FROM users" )

firstRow = rows . first ()

print firstRow . name

Example 5-31. Creating a HiveContext and selecting data in Scala

import org.apache.spark.sql.hive.HiveContext

val hiveCtx = new org . apache . spark . sql . hive . HiveContext ( sc )

val rows = hiveCtx . sql ( "SELECT name, age FROM users" )

Search WWH ::

Custom Search

Home