Database Reference
In-Depth Information
Spark SQL/HiveQL type
Scala type
Java type
Python
STRUCT<COL1:
COL1_TYPE, ...>
Row
Row
Row
The last type, structures, is simply represented as other
Row
s in Spark SQL. All of
these types can also be nested within each other; for example, you can have arrays of
structs, or maps that contain structs.
Working with Row objects
Row
objects represent records inside SchemaRDDs, and are simply fixed-length arrays
of fields. In Scala/Java,
Row
objects have a number of getter functions to obtain the
value of each field given its index. The standard getter,
get
(or
apply
in Scala), takes a
column number and returns an
Object
type (or
Any
in Scala) that we are responsible
for casting to the correct type. For
Boolean
,
Byte
,
Double
,
Float
,
Int
,
Long
,
Short
,
and
String
, there is a
getType()
method, which returns that type. For example,
get
String(0)
would return field 0 as a string, as you can see in Examples
9-12
and
9-13
.
Example 9-12. Accessing the text column (also first column) in the topTweets
SchemaRDD in Scala
val
topTweetText
=
topTweets
.
map
(
row
=>
row
.
getString
(
0
))
Example 9-13. Accessing the text column (also first column) in the topTweets
SchemaRDD in Java
JavaRDD
<
String
>
topTweetText
=
topTweets
.
toJavaRDD
().
map
(
new
Function
<
Row
,
String
>()
{
public
String
call
(
Row
row
)
{
return
row
.
getString
(
0
);
}});
In Python,
Row
objects are a bit different since we don't have explicit typing. We just
access the
i
th element using
row[i]
. In addition, Python
Row
s support named access
to their fields, of the form
row.
column_name
, as you can see in
Example 9-14
. If you
are uncertain of what the column names are, we illustrate printing the schema in
“JSON” on page 172
.
Example 9-14. Accessing the text column in the topTweets SchemaRDD in Python
topTweetText
=
topTweets
.
map
(
lambda
row
:
row
.
text
)