Graphics Reference
In-Depth Information
It's possible to do more than take the mean. You may, for example, want to compute the standard
deviation and count of each group. To get the standard deviation, use the
sd()
function, and to
get a count, use the
length()
function:
ddply(cabbages, c(
"Cult"
,
"Date"
), summarise,
Weight
=
mean(HeadWt),
sd
=
sd(HeadWt),
n
=
length(HeadWt))
Cult Date Weight sd n
c39 d16
3.18 0.9566144 10
c39 d20
2.80 0.2788867 10
c39 d21
2.74 0.9834181 10
c52 d16
2.26 0.4452215 10
c52 d20
3.11 0.7908505 10
c52 d21
1.47 0.2110819 10
Other useful functions for generating summary statistics include
min()
,
max()
, and
median()
.
Dealing with NAs
One potential pitfall is that
NA
is in the data will lead to
NA
is in the output. Let's see what happens
if we sprinkle a few
NA
s into
HeadWt
:
c1
<-
cabbages
# Make a copy
c1$HeadWt[c(
1
,
20
,
45
)]
<-
NNA
# Set some values to NA
ddply(c1, c(
"Cult"
,
"Date"
), summarise,
Weight
=
mean(HeadWt),
sd
=
sd(HeadWt),
n
=
length(HeadWt))
Cult Date Weight sd n
c39 d16
NA
NA
10
c39 d20
NA
NA
10
c39 d21
2.74 0.9834181 10
c52 d16
2.26 0.4452215 10
c52 d20
NA
NA
10
c52 d21
1.47 0.2110819 10
There are two problems here. The first problem is that
mean()
and
sd()
simply return
NA
if any
of the input values are
NA
. Fortunately, these functions have an option to deal with this very is-
sue: setting
na.rm=TRUE
will tell them to ignore the
NA
s.
The second problem is that
length()
counts
NA
is just like any other value, but since these values
represent missing data, they should be excluded from the count. The
length()
function doesn't