Database Reference
In-Depth Information
Luckily, there is a command-line tool called
in2csv
(Groskopf, 2014), which is able to
convert Microsoft Excel spreadsheets to CSV files. CSV stands for comma-separated
values or character-separated values. Working with CSV can be tricky because it lacks
a formal specification.
RFC 4180
defines the CSV format according to the following
three points:
1. Each record is located on a separate line, delimited by a line break (CRLF). For
example:
aaa,bbb,ccc CRLF
zzz,yyy,xxx CRLF
2. The last record in the file may or may not have an ending line break. For exam‐
ple:
aaa,bbb,ccc CRLF
zzz,yyy,xxx
3. There may be an optional header line appearing as the first line of the file with
the same format as normal record lines. This header will contain names corre‐
sponding to the fields in the file and should contain the same number of fields as
the records in the rest of the file (the presence or absence of the header line
should be indicated via the optional
header
parameter of this MIME type). For
example:
field_name,field_name,field_name CRLF
aaa,bbb,ccc CRLF
zzz,yyy,xxx CRLF
Let's demonstrate
in2csv
using a spreadsheet that contains the top 250 movies from
the Internet Movie Database (IMDb). The file is named
imdb-250.xlsx
and can be
obtained from
http://bit.ly/analyzing_top250_movies_list
.
To extract its data, we
invoke
in2csv
as follows:
$
cd
~/book/ch03
$
in2csv data/imdb-250.xlsx > data/imdb-250.csv
The format of the file is automatically determined by the extension,
.xlsx
in this case.
If we were to pipe the data into
in2csv
, we would have to specify the format explic‐
itly. Let's look at the data:
$
in2csv data/imdb-250.xlsx | head | cut -c1-80
Title,title trim,Year,Rank,Rank (desc),Rating,New in 2011 from 2010?,2010 rank,R
Sherlock Jr. (1924),SherlockJr.(1924),1924,221,30,8,y,n/a,n/a,
The Passion of Joan of Arc (1928),ThePassionofJoanofArc(1928),1928,212,39,8,y,n/
His Girl Friday (1940),HisGirlFriday(1940),1940,250,1,8,y,n/a,n/a,
Tokyo Story (1953),TokyoStory(1953),1953,248,3,8,y,n/a,n/a,
The Man Who Shot Liberty Valance (1962),TheManWhoShotLibertyValance(1962),1962,2
Persona (1966),Persona(1966),1966,200,51,8,y,n/a,n/a,
Stalker (1979),Stalker(1979),1979,243,8,8,y,n/a,n/a,