Database Reference
In-Depth Information
CHAPTER 5
Scrubbing Data
In Chapter 2 , we looked at the first step of the OSEMN model for data science, how
to obtain data from a variety of sources. It's not uncommon for this data to have miss‐
ing values, inconsistencies, errors, weird characters, or uninteresting columns. Some‐
times we only need a specific portion of the data. And sometimes we need the data to
be in a different format. In those cases, we have to clean, or scrub , the data before we
can move on to the third step: exploring data.
The data we obtained in Chapter 3 can come in a variety of formats. The most com‐
mon ones are plain text, CSV, JSON, and HTML/XML. Because most command-line
tools operate on one format only, it is worthwhile to be able to convert data from one
format to another.
CSV, which is the main format we're working with in this chapter, is actually not the
easiest format to work with. Many CSV data sets are broken or incompatible with
each other because there is no standard syntax, unlike XML and JSON.
Once our data is in the format we want it to be, we can apply common scrubbing
operations. These include filtering, replacing, and merging data. The command line is
especially well-suited for these kind of operations, as there exist many powerful
command-line tools that are optimized for handling large amounts of data. Tools that
we'll discuss in this chapter include classic ones such as: cut (Ihnat, MacKenzie, &
Meyering, 2012) and sed (Fenlason, Lord, Pizzini, & Bonzini, 2012), and newer ones
such as jq (Dolan, 2014) and csvgrep (Groskopf, 2014).
The scrubbing tasks that we discuss in this chapter not only apply to the input data.
Sometimes, we also need to reformat the output of some command-line tools. For
example, to transform the output of uniq -c to a CSV data set, we could use awk
(Brennan, 1994) and header :
 
Search WWH ::




Custom Search