Database Reference
In-Depth Information
val rows = hiveCtx . sql ( "SELECT key, value FROM mytable" )
val keys = rows . map ( row => row . getInt ( 0 ))
Example 9-17. Hive load in Java
import org.apache.spark.sql.hive.HiveContext ;
import org.apache.spark.sql.Row ;
import org.apache.spark.sql.SchemaRDD ;
HiveContext hiveCtx = new HiveContext ( sc );
SchemaRDD rows = hiveCtx . sql ( "SELECT key, value FROM mytable" );
JavaRDD < Integer > keys = rdd . toJavaRDD (). map ( new Function < Row , Integer >() {
public Integer call ( Row row ) { return row . getInt ( 0 ); }
});
Parquet
Parquet is a popular column-oriented storage format that can store records with nes‐
ted fields efficiently. It is often used with tools in the Hadoop ecosystem, and it sup‐
ports all of the data types in Spark SQL. Spark SQL provides methods for reading
data directly to and from Parquet files.
First, to load data, you can use HiveContext.parquetFile or SQLContext.parquet
File , as shown in Example 9-18 .
Example 9-18. Parquet load in Python
# Load some data in from a Parquet file with field's name and favouriteAnimal
rows = hiveCtx . parquetFile ( parquetFile )
names = rows . map ( lambda row : row . name )
print "Everyone"
print names . collect ()
You can also register a Parquet file as a Spark SQL temp table and write queries
against it. Example 9-19 continues from Example 9-18 where we loaded the data.
Example 9-19. Parquet query in Python
# Find the panda lovers
tbl = rows . registerTempTable ( "people" )
pandaFriends = hiveCtx . sql ( "SELECT name FROM people WHERE favouriteAnimal = \" panda \" " )
print "Panda friends"
print pandaFriends . map ( lambda row : row . name ) . collect ()
Finally, you can save the contents of a SchemaRDD to Parquet with saveAsParquet
File() , as shown in Example 9-20 .
Search WWH ::




Custom Search