Database Reference
In-Depth Information
could be custodial accounts or college savings accounts set up by the parents of
young children. These accounts should be retained for future analyses.
However, the left side of the graph shows a huge spike of customers who are zero
years old or have negative ages. This is likely to be evidence of missing data .
One possible explanation is that the null age values could have been replaced by
0 or negative values during the data input. Such an occurrence may be caused by
entering age in a text box that only allows numbers and does not accept empty
values. Or it might be caused by transferring data among several systems that have
different definitions for null values (such as NULL, NA, 0, -1, or -2). Therefore,
data cleansing needs to be performed over the accounts with abnormal age
values. Analysts should take a closer look at the records to decide if the missing
data should be eliminated or if an appropriate age value can be determined using
other available information for each of the accounts.
In R, the is.na() function provides tests for missing values. The following
example creates a vector x where the fourth value is not available ( NA ). The
is.na() function returns TRUE at each NA value and FALSE otherwise.
x <- c(1, 2, 3, NA, 4)
is.na(x)
[1] FALSE FALSE FALSE TRUE FALSE
Some arithmetic functions, such as mean() , applied to data containing missing
values can yield an NA result. To prevent this, set the na.rm parameter to TRUE to
remove the missing value during the function's execution.
mean(x)
[1] NA
mean(x, na.rm=TRUE)
[1] 2.5
The na.exclude() function returns the object with incomplete cases removed.
DF <- data.frame(x = c(1, 2, 3), y = c(10, 20, NA))
DF
x y
1 1 10
2 2 20
3 3 NA
Search WWH ::




Custom Search