Database Reference
In-Depth Information
val
rows
=
hiveCtx
.
sql
(
"SELECT key, value FROM mytable"
)
val
keys
=
rows
.
map
(
row
=>
row
.
getInt
(
0
))
Example 9-17. Hive load in Java
import
org.apache.spark.sql.hive.HiveContext
;
import
org.apache.spark.sql.Row
;
import
org.apache.spark.sql.SchemaRDD
;
HiveContext
hiveCtx
=
new
HiveContext
(
sc
);
SchemaRDD
rows
=
hiveCtx
.
sql
(
"SELECT key, value FROM mytable"
);
JavaRDD
<
Integer
>
keys
=
rdd
.
toJavaRDD
().
map
(
new
Function
<
Row
,
Integer
>()
{
public
Integer
call
(
Row
row
)
{
return
row
.
getInt
(
0
);
}
});
Parquet
Parquet
is a popular column-oriented storage format that can store records with nes‐
ted fields efficiently. It is often used with tools in the Hadoop ecosystem, and it sup‐
ports all of the data types in Spark SQL. Spark SQL provides methods for reading
data directly to and from Parquet files.
First, to load data, you can use
HiveContext.parquetFile
or
SQLContext.parquet
File
, as shown in
Example 9-18
.
Example 9-18. Parquet load in Python
# Load some data in from a Parquet file with field's name and favouriteAnimal
rows
=
hiveCtx
.
parquetFile
(
parquetFile
)
names
=
rows
.
map
(
lambda
row
:
row
.
name
)
print
"Everyone"
print
names
.
collect
()
You can also register a Parquet file as a Spark SQL temp table and write queries
against it.
Example 9-19
continues from
Example 9-18
where we loaded the data.
Example 9-19. Parquet query in Python
# Find the panda lovers
tbl
=
rows
.
registerTempTable
(
"people"
)
pandaFriends
=
hiveCtx
.
sql
(
"SELECT name FROM people WHERE favouriteAnimal =
\"
panda
\"
"
)
print
"Panda friends"
print
pandaFriends
.
map
(
lambda
row
:
row
.
name
)
.
collect
()
Finally, you can save the contents of a SchemaRDD to Parquet with
saveAsParquet
File()
, as shown in
Example 9-20
.