Database and Data Management - Field Guide to Hadoop

Database Reference

In-Depth Information

in memory as possible, using the disks as infrequently as possible. But abandoning MapRe-

duce has made Spark SQL much faster, even if it requires disk access.

While Spark SQL speaks HQL, the Hive query language, it has a few extra features that

aren't in Hive. One is the ability to encache table data for the duration of a user session. This

corresponds to temporary tables in many other databases, but unlike other databases, these

tables live in memory and are thus accessed much faster. Spark SQL also allows access to

tables as though they were Spark Resilient Distributed Datasets (RDD).

Spark SQL supports the Hive metastore, most of its query language, and data formats, so ex-

isting Hive users should have an easier time converting to Shark than many others. However,

while the Spark SQL documentation is currently not absolutely clear on this, not all the Hive

features have yet been implemented in Spark SQL. APIs currently exist for Python, Java, and

Scala. See “Hive” for more details. Spark SQL also can run Spark's MLlib machine-learning

algorithms as SQL statements.

Spark SQL can use JSON (described here ) and Parquet (described here ) as data sources, so

it's pretty useful in an HDFS environment.

Tutorial Links

There are a wealth of tutorials on the project home page .

Example Code

At the user level, Shark looks like Hive, so if you can code in Hive, you can almost code in

Spark SQL. But you need to set up your Spark SQL environment. Here's how you would do

it in Python using the movie review data we use in other examples (to understand the setup,

you'll need to read “Spark” , as well as have some knowledge of Python):

# Spark requires a Context object. Let's assume it exists

# already. You need a SQL Context object as well

from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)

# Load a the CSV text file and convert each line to a Python

# dictionary using lambda notation for anonymous functions.

lines = sc.textFile("reviews.csv")

movies = lines.map(lambda l: l.split(","))

reviews = movies.map(

lambda p: {"name": p[0], "title": p[1], "rating": int(p[2])})

Search WWH ::

Custom Search

Home