Database Reference
In-Depth Information
Example 9-4. Java SQL imports
// Import Spark SQL
import
org.apache.spark.sql.hive.HiveContext
;
// Or if you can't have the hive dependencies
import
org.apache.spark.sql.SQLContext
;
// Import the JavaSchemaRDD
import
org.apache.spark.sql.SchemaRDD
;
import
org.apache.spark.sql.Row
;
Example 9-5. Python SQL imports
# Import Spark SQL
from
pyspark.sql
import
HiveContext
,
Row
# Or if you can't include the hive requirements
from
pyspark.sql
import
SQLContext
,
Row
Once we've added our imports, we need to create a HiveContext, or a SQLContext if
we cannot bring in the Hive dependencies (see Examples
9-6
through
9-8
). Both of
these classes take a SparkContext to run on.
Example 9-6. Constructing a SQL context in Scala
val
sc
=
new
SparkContext
(...)
val
hiveCtx
=
new
HiveContext
(
sc
)
Example 9-7. Constructing a SQL context in Java
JavaSparkContext
ctx
=
new
JavaSparkContext
(...);
SQLContext
sqlCtx
=
new
HiveContext
(
ctx
);
Example 9-8. Constructing a SQL context in Python
hiveCtx
=
HiveContext
(
sc
)
Now that we have a HiveContext or SQLContext, we are ready to load our data and
query it.
Basic Query Example
To make a query against a table, we call the
sql()
method on the HiveContext or
SQLContext. The first thing we need to do is tell Spark SQL about some data to
query. In this case we will load some Twitter data from JSON, and give it a name by
registering it as a “temporary table” so we can query it with SQL. (We will go over
more details on loading in
“Loading and Saving Data” on page 170
.) Then we can select
the top tweets by
retweetCount
. See Examples
9-9
through
9-11
.