Databases Reference
In-Depth Information
3.7
Exercises
3.1
Data quality
can be assessed in terms of several issues, including accuracy, completeness,
and consistency. For each of the above three issues, discuss how data quality assess-
ment can depend on the
intended use
of the data, giving examples. Propose two other
dimensions of data quality.
3.2
In real-world data, tuples with
missing values
for some attributes are a common
occurrence. Describe various methods for handling this problem.
3.3
Exercise 2.2 gave the following data (in increasing order) for the attribute
age
: 13, 15,
16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46,
52, 70.
(a) Use
smoothing by bin means
to smooth these data, using a bin depth of 3. Illustrate
your steps. Comment on the effect of this technique for the given data.
(b) How might you determine
outliers
in the data?
(c) What other methods are there for
data smoothing
?
3.4
Discuss issues to consider during
data integration
.
3.5
What are the value ranges of the following
normalization methods
?
(a) min-max normalization
(b) z-score normalization
(c) z-score normalization using the mean absolute deviation instead of standard devia-
tion
(d) normalization by decimal scaling
3.6
Use these methods to
normalize
the following group of data:
200, 300, 400, 600, 1000
(a) min-max normalization by setting
min
D 0 and
max
D 1
(b) z-score normalization
(c) z-score normalization using the mean absolute deviation instead of standard devia-
tion
(d) normalization by decimal scaling
3.7
Using the data for
age
given in Exercise 3.3, answer the following:
(a) Use min-max normalization to transform the value 35 for
age
onto the range
[0.0, 1.0].
(b) Use z-score normalization to transform the value 35 for
age
, where the standard
deviation of
age
is 12.94 years.
(c) Use normalization by decimal scaling to transform the value 35 for
age
.
(d) Comment on which method you would prefer to use for the given data, giving
reasons as to why.