Databases Reference
In-Depth Information
Outliers and data validation
An outlier is an observation containing column values or a combination of
values lying outside the expected range of the population fromwhich the dataset
observations were drawn. When outliers are included in datasets applied to data
mining algorithms, they may bias or invalidate the results of the analysis.
Hence, before application of the algorithms, the outliers should be identified
and either corrected or removed.
Outliers are usually generated in the data collection process. The measuring
instrument may have been faulty; there may have been errors in data entry; or
there may have been errors in communications or transmission.
Occasionally, the outlying entries may be valid, yet drawn from a different
population than the rest. These values may be explained by additional attributes
not captured in the dataset. For example, employers in a given region frequently
share their salary information in order to help each other understand how their
salaries compare to those of other employers in the region. In a study of
“custodian” salaries, most employers reported salaries in the range of $20,000
to $40,000 per year. One employer, however, reported a salary of $95,000. When
contacted to verify the salary, the employer responded, “Yes, that is correct, this is
my father-in-law, a part-owner of the company. He was given the title of
'custodian', because he has assumed the responsibility for taking out the trash
and vacuuming the floors.” In this case, the data was correct, but the definition of
“custodian” was not the same as that accepted by other submitting employers.
In the search for outliers there are a number of checks that may be applied.
Range checks - values that are outside the expected range for a given
attribute.
Computed checks - if there is redundancy in the dataset, is there consist-
ency in the redundancy? For example, does the total of all probabilities sum
to one?
Feasibility and consistency checks - are all attribute combinations possi-
ble? For example, is it possible to have a female patient diagnosed with
prostate cancer?
Pattern checks - when there are patterns observed between attributes in a
dataset, do all observations reasonably fit those patterns?
Temporal checks - in datasets containing multiple observations of the same
entity over time, are there large unexpected changes in a given attribute
value for a single entity?
The search for outliers can be a time-consuming operation - but it is important.
In the examples that follow, visual approaches are presented using VisMiner.
They enable the search for outliers and support a more efficient process.
 
Search WWH ::




Custom Search