Database Reference
In-Depth Information
Handling incorrectly formatted records can be a big problem, espe‐
cially with semistructured data like JSON. With small datasets it
can be acceptable to stop the world (i.e., fail the program) on mal‐
formed input, but often with large datasets malformed input is
simply a part of life. If you do choose to skip incorrectly formatted
data, you may wish to look at using
accumulators
to keep track of
the number of errors.
Saving JSON
Writing out JSON files is much simpler compared to loading it, because we don't
have to worry about incorrectly formatted data and we know the type of the data that
we are writing out. We can use the same libraries we used to convert our RDD of
strings into parsed JSON data and instead take our RDD of structured data and con‐
vert it into an RDD of strings, which we can then write out using Spark's text file API.
Let's say we were running a promotion for people who love pandas. We can take our
input from the first step and filter it for the people who love pandas, as shown in
Examples
5-9
through
5-11
.
Example 5-9. Saving JSON in Python
(
data
.
filter
(
lambda
x
:
x
[
'lovesPandas'
])
.
map
(
lambda
x
:
json
.
dumps
(
x
))
.
saveAsTextFile
(
outputFile
))
Example 5-10. Saving JSON in Scala
result
.
filter
(
p
=>
P
.
lovesPandas
).
map
(
mapper
.
writeValueAsString
(
_
))
.
saveAsTextFile
(
outputFile
)
Example 5-11. Saving JSON in Java
class
WriteJson
implements
FlatMapFunction
<
Iterator
<
Person
>,
String
>
{
public
Iterable
<
String
>
call
(
Iterator
<
Person
>
people
)
throws
Exception
{
ArrayList
<
String
>
text
=
new
ArrayList
<
String
>();
ObjectMapper
mapper
=
new
ObjectMapper
();
while
(
people
.
hasNext
())
{
Person
person
=
people
.
next
();
text
.
add
(
mapper
.
writeValueAsString
(
person
));
}
return
text
;
}
}
JavaRDD
<
Person
>
result
=
input
.
mapPartitions
(
new
ParseJson
()).
filter
(
new
LikesPandas
());
JavaRDD
<
String
>
formatted
=
result
.
mapPartitions
(
new
WriteJson
());
formatted
.
saveAsTextFile
(
outfile
);