Review of Basic Data Analytic Methods Using R - Data Science and Big Data Analytics

Database Reference

In-Depth Information

library(ggplot2)

# plot the jittered scatterplot w/ boxplot

# color-code points with zip codes

# the outlier.size=0 prevents the boxplot from plotting the

outlier

ggplot(data=DF, aes(x=as.factor(Zip1),

y=log10(MeanHouseholdIncome))) +

geom_point(aes(color=factor(Zip1)), alpha=0.2,

position="jitter") +

geom_boxplot(outlier.size=0, alpha=0.1) +

guides(colour=FALSE) +

ggtitle ("Mean Household Income by Zip Code")

Alternatively, one can create a simple box-and-whisker plot with the boxplot()

function provided by the R base package.

Hexbinplot for Large Datasets

This chapter has shown that scatterplot as a popular visualization can visualize

data containing one or more variables. But one should be careful about using it

on high-volume data. If there is too much data, the structure of the data may

become difficult to see in a scatterplot. Consider a case to compare the logarithm

of household income against the years of education, as shown in Figure 3.17 . The

cluster in the scatterplot on the left (a) suggests a somewhat linear relationship

of the two variables. However, one cannot really see the structure of how the

data is distributed inside the cluster. This is a Big Data type of problem. Millions

or billions of data points would require different approaches for exploration,

visualization, and analysis.

Search WWH ::

Custom Search

Home