Loading and Saving Your Data - Learning Spark

Database Reference

In-Depth Information

Handling incorrectly formatted records can be a big problem, espe‐

cially with semistructured data like JSON. With small datasets it

can be acceptable to stop the world (i.e., fail the program) on mal‐

formed input, but often with large datasets malformed input is

simply a part of life. If you do choose to skip incorrectly formatted

data, you may wish to look at using accumulators to keep track of

the number of errors.

Saving JSON

Writing out JSON files is much simpler compared to loading it, because we don't

have to worry about incorrectly formatted data and we know the type of the data that

we are writing out. We can use the same libraries we used to convert our RDD of

strings into parsed JSON data and instead take our RDD of structured data and con‐

vert it into an RDD of strings, which we can then write out using Spark's text file API.

Let's say we were running a promotion for people who love pandas. We can take our

input from the first step and filter it for the people who love pandas, as shown in

Examples 5-9 through 5-11 .

Example 5-9. Saving JSON in Python

( data . filter ( lambda x : x [ 'lovesPandas' ]) . map ( lambda x : json . dumps ( x ))

. saveAsTextFile ( outputFile ))

Example 5-10. Saving JSON in Scala

result . filter ( p => P . lovesPandas ). map ( mapper . writeValueAsString ( _ ))

. saveAsTextFile ( outputFile )

Example 5-11. Saving JSON in Java

class WriteJson implements FlatMapFunction < Iterator < Person >, String > {

public Iterable < String > call ( Iterator < Person > people ) throws Exception {

ArrayList < String > text = new ArrayList < String >();

ObjectMapper mapper = new ObjectMapper ();

while ( people . hasNext ()) {

Person person = people . next ();

text . add ( mapper . writeValueAsString ( person ));

}

return text ;

}

JavaRDD < Person > result = input . mapPartitions ( new ParseJson ()). filter (

new LikesPandas ());

JavaRDD < String > formatted = result . mapPartitions ( new WriteJson ());

formatted . saveAsTextFile ( outfile );

Search WWH ::

Custom Search

Home