Database Reference
In-Depth Information
in memory as possible, using the disks as infrequently as possible. But abandoning MapRe-
duce has made Spark SQL much faster, even if it requires disk access.
While Spark SQL speaks HQL, the Hive query language, it has a few extra features that
aren't in Hive. One is the ability to encache table data for the duration of a user session. This
corresponds to temporary tables in many other databases, but unlike other databases, these
tables live in memory and are thus accessed much faster. Spark SQL also allows access to
tables as though they were Spark Resilient Distributed Datasets (RDD).
Spark SQL supports the Hive metastore, most of its query language, and data formats, so ex-
isting Hive users should have an easier time converting to Shark than many others. However,
while the Spark SQL documentation is currently not absolutely clear on this, not all the Hive
features have yet been implemented in Spark SQL. APIs currently exist for Python, Java, and
Scala. See “Hive” for more details. Spark SQL also can run Spark's MLlib machine-learning
algorithms as SQL statements.
Spark SQL can use JSON (described here ) and Parquet (described here ) as data sources, so
it's pretty useful in an HDFS environment.
Tutorial Links
There are a wealth of tutorials on the project home page .
Example Code
At the user level, Shark looks like Hive, so if you can code in Hive, you can almost code in
Spark SQL. But you need to set up your Spark SQL environment. Here's how you would do
it in Python using the movie review data we use in other examples (to understand the setup,
you'll need to read “Spark” , as well as have some knowledge of Python):
# Spark requires a Context object. Let's assume it exists
# already. You need a SQL Context object as well
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
# Load a the CSV text file and convert each line to a Python
# dictionary using lambda notation for anonymous functions.
lines = sc.textFile("reviews.csv")
movies = lines.map(lambda l: l.split(","))
reviews = movies.map(
lambda p: {"name": p[0], "title": p[1], "rating": int(p[2])})
Search WWH ::




Custom Search