Database Reference
In-Depth Information
3.2 Exploratory Data Analysis
So far, this chapter has addressed importing and exporting data in R, basic data
types and operations, and generating descriptive statistics. Functions such as
summary() can help analysts easily get an idea of the magnitude and range of
the data, but other aspects such as linear relationships and distributions are more
difficult to see from descriptive statistics. For example, the following code shows a
summary view of a data frame data with two columns x and y . The output shows
the range of x and y , but it's not clear what the relationship may be between these
two variables.
summary(data)
x y
Min. :-1.90483 Min. :-2.16545
1st Qu.:-0.66321 1st Qu.:-0.71451
Median : 0.09367 Median :-0.03797
Mean : 0.02522 Mean :-0.02153
3rd Qu.: 0.65414 3rd Qu.: 0.55738
Max. : 2.18471 Max. : 1.70199
A useful way to detect patterns and anomalies in the data is through the exploratory
data analysis with visualization. Visualization gives a succinct, holistic view of the
data that may be difficult to grasp from the numbers and summaries alone.
Variables x and y of the data frame data can instead be visualized in a scatterplot
( Figure 3.5 ), which easily depicts the relationship between two variables. An
important facet of the initial data exploration, visualization assesses data
cleanliness and suggests potentially important relationships in the data prior to the
model planning and building phases.
Search WWH ::




Custom Search