Database Reference
In-Depth Information
Example 9-9. Loading and quering tweets in Scala
val
input
=
hiveCtx
.
jsonFile
(
inputFile
)
// Register the input schema RDD
input
.
registerTempTable
(
"tweets"
)
// Select tweets based on the retweetCount
val
topTweets
=
hiveCtx
.
sql
(
"SELECT text, retweetCount FROM
tweets ORDER BY retweetCount LIMIT 10"
)
Example 9-10. Loading and quering tweets in Java
SchemaRDD
input
=
hiveCtx
.
jsonFile
(
inputFile
);
// Register the input schema RDD
input
.
registerTempTable
(
"tweets"
);
// Select tweets based on the retweetCount
SchemaRDD
topTweets
=
hiveCtx
.
sql
(
"SELECT text, retweetCount FROM
tweets ORDER BY retweetCount LIMIT 10"
);
Example 9-11. Loading and quering tweets in Python
input
=
hiveCtx
.
jsonFile
(
inputFile
)
# Register the input schema RDD
input
.
registerTempTable
(
"tweets"
)
# Select tweets based on the retweetCount
topTweets
=
hiveCtx
.
sql
(
"""SELECT text, retweetCount FROM
tweets ORDER BY retweetCount LIMIT 10"""
)
If you have an existing Hive installation, and have copied your
hive-site.xml
file to
$SPARK_HOME/conf
, you can also just run
hiveCtx.sql
to query your existing Hive tables.
SchemaRDDs
Both loading data and executing queries return SchemaRDDs. SchemaRDDs are sim‐
ilar to tables in a traditional database. Under the hood, a SchemaRDD is an RDD
composed of
Row
objects with additional schema information of the types in each col‐
umn.
Row
objects are just wrappers around arrays of basic types (e.g., integers and
strings), and we'll cover them in more detail in the next section.
One important note: in future versions of Spark, the name SchemaRDD may be
changed to DataFrame. This renaming was still under discussion as this topic went to
print.
SchemaRDDs are also regular RDDs, so you can operate on them using existing RDD
transformations like
map()
and
filter()
. However, they provide several additional
capabilities. Most importantly, you can register any SchemaRDD as a temporary table