Database Reference
In-Depth Information
Descriptive Statistics
The calculation of mean, median, and standard deviation is performed with mean() , medi-
an() , and sd() , respectively. To generate a overview of each column of a data frame sum-
mary() :
> summary(ct)
PID GEO_ID GEO_ID2 SUMLEV
Min. : 1 Min. :4.21e+10 14000US42101000100: 1 Min. :140
1st Qu.: 96 1st Qu.:4.21e+10 14000US42101000200: 1 1st Qu.:140
Median :191 Median :4.21e+10 14000US42101000300: 1 Median :140
Mean :191 Mean :4.21e+10 14000US42101000400: 1 Mean :140
3rd Qu.:286 3rd Qu.:4.21e+10 14000US42101000500: 1 3rd Qu.:140
Max. :381 Max. :4.21e+10 14000US42101000600: 1 Max. :140
(snip)
> sd(ct, na.rm=TRUE)
#na.rm=TRUE is necessary if there are missing data
#in the standard deviation calculations.
#The size will be only available data.
PID GEO_ID GEO_ID2 SUMLEV
1.101295e+02 1.075777e+04 NA 0.000000e+00
GEONAME GEOCOMP STATE COUNTY
NA 0.000000e+00 0.000000e+00 0.000000e+00
TRACT STATEP00 COUNTYP00 TRACTCE00
1.075777e+04 1.101295e+02 0.000000e+00 1.075777e+04
(snip)
Not all of the columns will return a numeric value, especially if it's missing. For example,
MTFCC00 returns a NA . Its type is considered as a factor , as opposed to a num or int (see
output from str() above). The na.rm=TRUE in the sd function removes missing data. It also
follows a warning :
Warning messages:
1: In var(as.vector(x), na.rm = na.rm) : NAs introduced by coercion
The warning serves to alert the user that the column is not of num or int type. Of course, the
standard deviations of MTFCC00 or FUNCSTAT00 are nonsensical, and therefore uninterest-
ing to calculate. In this case, we can ignore the warning message.
Let's look at two more descriptive statistics, correlation and frequency:
> cor(ct[,c(18,19)], method="pearson", use="complete")
totalPop totalHousehold
Search WWH ::




Custom Search