Database Reference
In-Depth Information
3.2
Measuring the Amount of Information
There have been many efforts to date to quantify the amount of information in a
communication stream. If we think of plain text, there are numerous quantifiable
features, including:
-
The total number of words per minute
-
The occurrence of specific words
-
The frequency of occurrence for each word
-
The occurrence of word pairs, triples, phrases, and sentences.
There are problems, however, with such simplistic, syntax-only measurement. Words
can have variable significance; some are unnecessary or redundant. Many words can
encode the same concept. In fact, reading text or hearing speech may have no affect
on one's uncertainty regarding the subject of the text, e.g., you may already have
known it, or you don't understand the meaning of the words or their implied concepts.
This implies that the measurement of information content or volume can be specific to
the individual receiver and, as we'll see later, the task that is being performed based
on the communication.
Can we perform similar analysis on a dataset? Consider a table of numeric values.
Features of potential interest in the dataset include:
-
The count of number of entries or dimensions
-
The values
-
Clusters and their attributes (number, size, relations, …)
-
Trends and their attributes (size, rate of change, …)
-
Outliers and their attributes (number, degree of outlierness, relation to dense
regions, …)
-
Associations, correlations and any features between records, dimensions, or
individual values.
In fact, we can observe that a featureless dataset is not differentiable from random
noise: all values are equally likely. Features and relations can also vary in their mag-
nitude, certainty, complexity, and importance. Clusters may differ in size; outliers
may vary in their distance to the main body of data; features may be comprised of
many sub-features; in many cases, a feature that is significant to one observer may be
considered noise by another. Recently, researchers have proposed measuring and
counting insights [9], which are new knowledge gained during visual analysis. These
insights are generally specific to a particular task, some of which include [10]:
- Identify data characteristics
- Locate boundaries, critical points, other features
- Distinguish regions of different characteristics
- Categorize or classify
- Rank based on some order
- Compare to find similarities and differences
- Associate into relations
- Correlate by classifying relations.
For each of these tasks, we might have different accuracy requirements as well, which
can influence the resolution at which feature extraction is accomplished during com-
Search WWH ::




Custom Search