Database Reference
In-Depth Information
could be custodial accounts or college savings accounts set up by the parents of
young children. These accounts should be retained for future analyses.
However, the left side of the graph shows a huge spike of customers who are zero
years old or have negative ages. This is likely to be evidence of
missing data
.
One possible explanation is that the null age values could have been replaced by
0 or negative values during the data input. Such an occurrence may be caused by
entering age in a text box that only allows numbers and does not accept empty
values. Or it might be caused by transferring data among several systems that have
different definitions for null values (such as NULL, NA, 0, -1, or -2). Therefore,
data cleansing
needs to be performed over the accounts with abnormal age
values. Analysts should take a closer look at the records to decide if the missing
data should be eliminated or if an appropriate age value can be determined using
other available information for each of the accounts.
In R, the
is.na()
function provides tests for missing values. The following
example creates a vector
x
where the fourth value is not available (
NA
). The
is.na()
function returns
TRUE
at each
NA
value and
FALSE
otherwise.
x <- c(1, 2, 3, NA, 4)
is.na(x)
[1] FALSE FALSE FALSE TRUE FALSE
Some arithmetic functions, such as
mean()
, applied to data containing missing
values can yield an
NA
result. To prevent this, set the
na.rm
parameter to
TRUE
to
remove the missing value during the function's execution.
mean(x)
[1] NA
mean(x, na.rm=TRUE)
[1] 2.5
The
na.exclude()
function returns the object with incomplete cases removed.
DF <- data.frame(x = c(1, 2, 3), y = c(10, 20, NA))
DF
x y
1 1 10
2 2 20
3 3 NA