Exploring Data - Data Science at the Command Line

Database Reference

In-Depth Information

CHAPTER 7

Exploring Data

Now that we have obtained and scrubbed our data, we can continue with the third

step of the OSEMN model, which is to explore it. After all that hard work (unless you

already had clean data lying around!), it's time for some fun.

Exploring is the step where you familiarize yourself with the data. Being familiar with

the data is essential when you want to extract any value from it. For example, know‐

ing what kind of features the data has, means you know which ones are worth further

exploring and which ones you can use to answer any questions that you have.

Exploring your data can be done from three perspectives. The first perspective is to

inspect the data and its properties. Here, we want to know, for example, what the raw

data looks like, how many data points the data set has, and what kind of features the

data set has.

The second perspective from which we can explore out data is to compute descriptive

statistics. This perspective is useful for learning more about the individual features.

One advantage of this perspective is that the output is often brief and textual and can

therefore be printed on the command line.

The third perspective is to create visualizations of the data. From this perspective, we

can gain insight into how multiple features interact. We'll discuss a way of creating

visualizations that can be printed on the command line. However, most visualizations

are best displayed on graphical user interfaces. An advantage of visualizations over

descriptive statistics is that visualizations are more flexible and can convey much

more information.

Search WWH ::

Custom Search

Home