Database Reference
In-Depth Information
Descriptive Statistics
The calculation of mean, median, and standard deviation is performed with
mean()
,
medi-
an()
, and
sd()
, respectively. To generate a overview of each column of a data frame
sum-
mary()
:
> summary(ct)
PID GEO_ID GEO_ID2 SUMLEV
Min. : 1 Min. :4.21e+10 14000US42101000100: 1 Min. :140
1st Qu.: 96 1st Qu.:4.21e+10 14000US42101000200: 1 1st Qu.:140
Median :191 Median :4.21e+10 14000US42101000300: 1 Median :140
Mean :191 Mean :4.21e+10 14000US42101000400: 1 Mean :140
3rd Qu.:286 3rd Qu.:4.21e+10 14000US42101000500: 1 3rd Qu.:140
Max. :381 Max. :4.21e+10 14000US42101000600: 1 Max. :140
(snip)
> sd(ct, na.rm=TRUE)
#na.rm=TRUE is necessary if there are missing data
#in the standard deviation calculations.
#The size will be only available data.
PID GEO_ID GEO_ID2 SUMLEV
1.101295e+02 1.075777e+04 NA 0.000000e+00
GEONAME GEOCOMP STATE COUNTY
NA 0.000000e+00 0.000000e+00 0.000000e+00
TRACT STATEP00 COUNTYP00 TRACTCE00
1.075777e+04 1.101295e+02 0.000000e+00 1.075777e+04
(snip)
Not all of the columns will return a numeric value, especially if it's missing. For example,
MTFCC00 returns a
NA
. Its type is considered as a
factor
, as opposed to a
num
or
int
(see
output from
str()
above). The
na.rm=TRUE
in the
sd
function removes missing data. It also
follows a warning :
Warning messages:
1: In var(as.vector(x), na.rm = na.rm) : NAs introduced by coercion
The warning serves to alert the user that the column is not of
num
or
int
type. Of course, the
standard deviations of MTFCC00 or FUNCSTAT00 are nonsensical, and therefore uninterest-
ing to calculate. In this case, we can ignore the warning message.
Let's look at two more descriptive statistics, correlation and frequency:
> cor(ct[,c(18,19)], method="pearson", use="complete")
totalPop totalHousehold