Modeling Data - Data Science at the Command Line

Database Reference

In-Depth Information

CHAPTER 9

Modeling Data

In this chapter, we'll perform the fourth step of the OSEMN model (and the last step

to require a computer): modeling data. Generally speaking, to model data is to create

an abstract or higher-level description of your data. Just like with creating visualiza‐

tions, it's like taking a step back from the individual data points.

Visualizations, on the one hand, are characterized by shapes, positions, and colors

such that we can interpret them by looking at them. Models, on the other hand, are

internally characterized by a bunch of numbers, which means that computers can use

them, for example, to make predictions about new data points. (We can still visualize

models so that we can try to understand them and see how they are performing.)

In this chapter, we'll consider four common types of algorithms to model data:

• Dimensionality reduction

• Clustering

• Regression

• Classification

These four types of algorithms come from the field of machine learning. As such,

we're going to change our vocabulary a bit. Let's assume that we have a CSV file, also

known as a data set . Each row, except for the header, is considered to be a data point .

For simplicity we assume that each column that contains numerical values is an input

feature . If a data point also contains a nonnumerical field, such as the species column

in the Iris data set, then that is known as the data point's label .

The first two types of algorithms (dimensionality reduction and clustering) are most

often unsupervised, which means that they create a model based on the features of

the data set only. The last two types of algorithms (regression and classification) are

Search WWH ::

Custom Search

Home