Database Reference
In-Depth Information
Performing SQL Queries on CSV
In case the command-line tools mentioned in this chapter do not provide enough
flexibility, then there's another approach to scrub your data using the command line.
The command-line tool csvsql (Groskopf, 2014) allows you to execute SQL queries
directly on CSV files. As you may know, SQL is a very powerful language to define
operations for scrubbing data; it's also a very different way than using individual
command-line tools.
If your data originally comes from a relational database, then, if
possible, try to execute SQL queries on that database and subse‐
quently extract the data as CSV. As discussed in Chapter 3 , you can
use the command-line tool sql2csv for this. When you first export
data from the database to a CSV file, and then apply SQL, it is not
only slower, but there is also a possibility that the column types are
not correctly inferred from the CSV data.
In the scrubbing tasks below, we'll include several solutions that involve csvsql . The
basic command is this:
$ seq 5 | header -a value | csvsql --query "SELECT SUM(value) AS sum FROM stdin"
sum
15
If you pass standard input to csvsql , then the table is named stdin . The types of the
column are automatically inferred from the data. As you'll see later, in the combining
CSV files section, you can also specify multiple CSV files. Keep in mind that csvsql
employs the SQLite dialect. While SQL is generally more verbose than the other solu‐
tions, it's also much more flexible. If you already know how to tackle a scrubbing
problem with SQL, then there's no shame in using it from the command line!
Working with HTML/XML and JSON
As we saw in Chapter 3 , our obtained data can come in a variety of formats. The most
common ones are plain text, CSV, JSON, and HTML/XML. In this section, we're
going to demonstrate a couple of command-line tools that can convert our data from
one format to another. There are two reasons to convert data.
First, oftentimes the data needs to be in tabular form, just like a database table or a
spreadsheet, because many visualization and machine-learning algorithms depend on
it. CSV is inherently in tabular form, but JSON and HTML/XML data can have a
deeply nested structure.
Second, many command-line tools, especially the classic ones such as cut and grep ,
operate on plain text. This is because text is regarded as a universal interface between
Search WWH ::




Custom Search