Spark SQL - Learning Spark

Database Reference

In-Depth Information

Spark SQL/HiveQL type

Scala type

Java type

Python

STRUCT<COL1:

COL1_TYPE, ...>

Row

The last type, structures, is simply represented as other Row s in Spark SQL. All of

these types can also be nested within each other; for example, you can have arrays of

structs, or maps that contain structs.

Working with Row objects

Row objects represent records inside SchemaRDDs, and are simply fixed-length arrays

of fields. In Scala/Java, Row objects have a number of getter functions to obtain the

value of each field given its index. The standard getter, get (or apply in Scala), takes a

column number and returns an Object type (or Any in Scala) that we are responsible

for casting to the correct type. For Boolean , Byte , Double , Float , Int , Long , Short ,

and String , there is a getType() method, which returns that type. For example, get

String(0) would return field 0 as a string, as you can see in Examples 9-12 and 9-13 .

Example 9-12. Accessing the text column (also first column) in the topTweets

SchemaRDD in Scala

val topTweetText = topTweets . map ( row => row . getString ( 0 ))

Example 9-13. Accessing the text column (also first column) in the topTweets

SchemaRDD in Java

JavaRDD < String > topTweetText = topTweets . toJavaRDD (). map ( new Function < Row , String >() {

public String call ( Row row ) {

return row . getString ( 0 );

}});

In Python, Row objects are a bit different since we don't have explicit typing. We just

access the i th element using row[i] . In addition, Python Row s support named access

to their fields, of the form row. column_name , as you can see in Example 9-14 . If you

are uncertain of what the column names are, we illustrate printing the schema in

“JSON” on page 172 .

Example 9-14. Accessing the text column in the topTweets SchemaRDD in Python

topTweetText = topTweets . map ( lambda row : row . text )

Search WWH ::

Custom Search

Home