Database Reference
In-Depth Information
Example 9-9. Loading and quering tweets in Scala
val input = hiveCtx . jsonFile ( inputFile )
// Register the input schema RDD
input . registerTempTable ( "tweets" )
// Select tweets based on the retweetCount
val topTweets = hiveCtx . sql ( "SELECT text, retweetCount FROM
tweets ORDER BY retweetCount LIMIT 10" )
Example 9-10. Loading and quering tweets in Java
SchemaRDD input = hiveCtx . jsonFile ( inputFile );
// Register the input schema RDD
input . registerTempTable ( "tweets" );
// Select tweets based on the retweetCount
SchemaRDD topTweets = hiveCtx . sql ( "SELECT text, retweetCount FROM
tweets ORDER BY retweetCount LIMIT 10" );
Example 9-11. Loading and quering tweets in Python
input = hiveCtx . jsonFile ( inputFile )
# Register the input schema RDD
input . registerTempTable ( "tweets" )
# Select tweets based on the retweetCount
topTweets = hiveCtx . sql ( """SELECT text, retweetCount FROM
tweets ORDER BY retweetCount LIMIT 10""" )
If you have an existing Hive installation, and have copied your
hive-site.xml file to $SPARK_HOME/conf , you can also just run
hiveCtx.sql to query your existing Hive tables.
SchemaRDDs
Both loading data and executing queries return SchemaRDDs. SchemaRDDs are sim‐
ilar to tables in a traditional database. Under the hood, a SchemaRDD is an RDD
composed of Row objects with additional schema information of the types in each col‐
umn. Row objects are just wrappers around arrays of basic types (e.g., integers and
strings), and we'll cover them in more detail in the next section.
One important note: in future versions of Spark, the name SchemaRDD may be
changed to DataFrame. This renaming was still under discussion as this topic went to
print.
SchemaRDDs are also regular RDDs, so you can operate on them using existing RDD
transformations like map() and filter() . However, they provide several additional
capabilities. Most importantly, you can register any SchemaRDD as a temporary table
Search WWH ::




Custom Search