Database Reference
In-Depth Information
We can thus easily load and save JSON data with Spark by using the existing mecha‐
nism for working with text and adding JSON libraries.
Comma-Separated Values and Tab-Separated Values
Comma-separated value (CSV) files are supposed to contain a fixed number of fields
per line, and the fields are separated by a comma (or a tab in the case of tab-separated
value, or TSV, files). Records are often stored one per line, but this is not always the
case as records can sometimes span lines. CSV and TSV files can sometimes be
inconsistent, most frequently with respect to handling newlines, escaping, and ren‐
dering non-ASCII characters, or noninteger numbers. CSVs cannot handle nested
field types natively, so we have to unpack and pack to specific fields manually.
Unlike with JSON fields, each record doesn't have field names associated with it;
instead we get back row numbers. It is common practice in single CSV files to make
the first row's column values the names of each field.
Loading CSV
Loading CSV/TSV data is similar to loading JSON data in that we can first load it as
text and then process it. The lack of standardization of format leads to different ver‐
sions of the same library sometimes handling input in different ways.
As with JSON, there are many different CSV libraries, but we will use only one for
each language. Once again, in Python we use the included csv library. In both Scala
and Java we use opencsv .
There is also a Hadoop InputFormat, CSVInputFormat , that we can
use to load CSV data in Scala and Java, although it does not sup‐
port records containing newlines.
If your CSV data happens to not contain newlines in any of the fields, you can load
your data with textFile() and parse it, as shown in Examples 5-12 through 5-14 .
Example 5-12. Loading CSV with textFile() in Python
import csv
import StringIO
...
def loadRecord ( line ):
"""Parse a CSV line"""
input = StringIO . StringIO ( line )
reader = csv . DictReader ( input , fieldnames = [ "name" , "favouriteAnimal" ])
return reader . next ()
input = sc . textFile ( inputFile ) . map ( loadRecord )
Search WWH ::




Custom Search