Exploring Data - Data Science at the Command Line

Database Reference

In-Depth Information

Overview

In this chapter, you'll learn how to:

• Inspect the data and its properties

• Compute descriptive statistics

• Create data visualizations inside and outside the command line

Inspecting Data and Its Properties

In this section, we'll demonstrate how to inspect a data set and its properties. Because

the upcoming visualization and modeling techniques expect the data to be in tabular

form, we'll assume that the data is in CSV format. You can use the techniques

described in Chapter 5 to convert your data to CSV if necessary.

For simplicity's sake, we'll also assume that your data has a header. In the first subsec‐

tion, we are going to determine whether that is the case. Once we know we have the

data in place, we can continue answering the following questions:

• How many data points and features does the data set have?

• What does the raw data look like?

• What kind of features does the data set have?

• Can some of these features be treated as categorical or as factors?

Header or Not, Here I Come

You can check whether your file has a header by printing the first few lines:

$ head file.csv | csvlook

It's then up to you to decide whether the first line is indeed a header or already the

first data point. When the data set contains no header or when its header contains

newlines, you're best off going back and correcting that by scrubbing the date (refer

to Chapter 5 for information on how to do that).

Inspect All the Data

If you want to inspect the raw data, then it's best not to use the cat command-line

tool, as cat prints all the data to the screen in one go. In order to inspect the raw data

at your own pace, we recommend using less (Nudelman, 2013) with the -S option:

$ less -S file.csv

Search WWH ::

Custom Search

Home