Scrubbing Data - Data Science at the Command Line

Database Reference

In-Depth Information

CHAPTER 5

Scrubbing Data

In Chapter 2 , we looked at the first step of the OSEMN model for data science, how

to obtain data from a variety of sources. It's not uncommon for this data to have miss‐

ing values, inconsistencies, errors, weird characters, or uninteresting columns. Some‐

times we only need a specific portion of the data. And sometimes we need the data to

be in a different format. In those cases, we have to clean, or scrub , the data before we

can move on to the third step: exploring data.

The data we obtained in Chapter 3 can come in a variety of formats. The most com‐

mon ones are plain text, CSV, JSON, and HTML/XML. Because most command-line

tools operate on one format only, it is worthwhile to be able to convert data from one

format to another.

CSV, which is the main format we're working with in this chapter, is actually not the

easiest format to work with. Many CSV data sets are broken or incompatible with

each other because there is no standard syntax, unlike XML and JSON.

Once our data is in the format we want it to be, we can apply common scrubbing

operations. These include filtering, replacing, and merging data. The command line is

especially well-suited for these kind of operations, as there exist many powerful

command-line tools that are optimized for handling large amounts of data. Tools that

we'll discuss in this chapter include classic ones such as: cut (Ihnat, MacKenzie, &

Meyering, 2012) and sed (Fenlason, Lord, Pizzini, & Bonzini, 2012), and newer ones

such as jq (Dolan, 2014) and csvgrep (Groskopf, 2014).

The scrubbing tasks that we discuss in this chapter not only apply to the input data.

Sometimes, we also need to reformat the output of some command-line tools. For

example, to transform the output of uniq -c to a CSV data set, we could use awk

(Brennan, 1994) and header :

Search WWH ::

Custom Search

Home