Database Reference
In-Depth Information
Scrubbing Data
It is not uncommon that the obtained data has missing values, inconsistencies, errors,
weird characters, or uninteresting columns. In that case, you have to scrub , or clean,
the data before you can do anything interesting with it. Common scrubbing opera‐
tions include:
• Filtering lines
• Extracting certain columns
• Replacing values
• Extracting words
• Handling missing values
• Converting data from one format to another
While we data scientists love to create exciting data visualizations and insightful mod‐
els (steps 3 and 4), usually much effort goes into obtaining and scrubbing the
required data first (steps 1 and 2). In “Data Jujitsu,” DJ Patil states that “80% of the
work in any data project is in cleaning the data” (2012). In Chapter 5 , we demonstrate
how the command line can help accomplish such data scrubbing operations.
Exploring Data
Once you have scrubbed your data, you are ready to explore it. This is where it gets
interesting, because here you will get really into your data. In Chapter 7 , we show you
how the command line can be used to:
• Look at your data.
• Derive statistics from your data.
• Create interesting visualizations.
Command-line tools introduced in Chapter 7 include csvstat (Groskopf, 2014),
feedgnuplot (Kogan, 2014), and Rio (Janssens, 2014).
Modeling Data
If you want to explain the data or predict what will happen, you probably want to cre‐
ate a statistical model of your data. Techniques to create a model include clustering,
classification, regression, and dimensionality reduction. The command line is not
suitable for implementing a new model from scratch. It is, however, very useful to be
able to build a model from the command line. In Chapter 9 , we will introduce several
command-line tools that either build a model locally or employ an API to perform
the computation in the cloud.
Search WWH ::




Custom Search