Introduction - Data Science at the Command Line

Database Reference

In-Depth Information

Scrubbing Data

It is not uncommon that the obtained data has missing values, inconsistencies, errors,

weird characters, or uninteresting columns. In that case, you have to scrub , or clean,

the data before you can do anything interesting with it. Common scrubbing opera‐

tions include:

• Filtering lines

• Extracting certain columns

• Replacing values

• Extracting words

• Handling missing values

• Converting data from one format to another

While we data scientists love to create exciting data visualizations and insightful mod‐

els (steps 3 and 4), usually much effort goes into obtaining and scrubbing the

required data first (steps 1 and 2). In “Data Jujitsu,” DJ Patil states that “80% of the

work in any data project is in cleaning the data” (2012). In Chapter 5 , we demonstrate

how the command line can help accomplish such data scrubbing operations.

Exploring Data

Once you have scrubbed your data, you are ready to explore it. This is where it gets

interesting, because here you will get really into your data. In Chapter 7 , we show you

how the command line can be used to:

• Look at your data.

• Derive statistics from your data.

• Create interesting visualizations.

Command-line tools introduced in Chapter 7 include csvstat (Groskopf, 2014),

feedgnuplot (Kogan, 2014), and Rio (Janssens, 2014).

Modeling Data

If you want to explain the data or predict what will happen, you probably want to cre‐

ate a statistical model of your data. Techniques to create a model include clustering,

classification, regression, and dimensionality reduction. The command line is not

suitable for implementing a new model from scratch. It is, however, very useful to be

able to build a model from the command line. In Chapter 9 , we will introduce several

command-line tools that either build a model locally or employ an API to perform

the computation in the cloud.

Search WWH ::

Custom Search

Home