Spark SQL - Learning Spark

Database Reference

In-Depth Information

val rows = hiveCtx . sql ( "SELECT key, value FROM mytable" )

val keys = rows . map ( row => row . getInt ( 0 ))

Example 9-17. Hive load in Java

import org.apache.spark.sql.hive.HiveContext ;

import org.apache.spark.sql.Row ;

import org.apache.spark.sql.SchemaRDD ;

HiveContext hiveCtx = new HiveContext ( sc );

SchemaRDD rows = hiveCtx . sql ( "SELECT key, value FROM mytable" );

JavaRDD < Integer > keys = rdd . toJavaRDD (). map ( new Function < Row , Integer >() {

public Integer call ( Row row ) { return row . getInt ( 0 ); }

});

Parquet

Parquet is a popular column-oriented storage format that can store records with nes‐

ted fields efficiently. It is often used with tools in the Hadoop ecosystem, and it sup‐

ports all of the data types in Spark SQL. Spark SQL provides methods for reading

data directly to and from Parquet files.

First, to load data, you can use HiveContext.parquetFile or SQLContext.parquet

File , as shown in Example 9-18 .

Example 9-18. Parquet load in Python

# Load some data in from a Parquet file with field's name and favouriteAnimal

rows = hiveCtx . parquetFile ( parquetFile )

names = rows . map ( lambda row : row . name )

print "Everyone"

print names . collect ()

You can also register a Parquet file as a Spark SQL temp table and write queries

against it. Example 9-19 continues from Example 9-18 where we loaded the data.

Example 9-19. Parquet query in Python

# Find the panda lovers

tbl = rows . registerTempTable ( "people" )

pandaFriends = hiveCtx . sql ( "SELECT name FROM people WHERE favouriteAnimal = \" panda \" " )

print "Panda friends"

print pandaFriends . map ( lambda row : row . name ) . collect ()

Finally, you can save the contents of a SchemaRDD to Parquet with saveAsParquet

File() , as shown in Example 9-20 .

Search WWH ::

Custom Search

Home