Understanding Data - Data Points: Visualization That Means Something

Graphics Programs Reference

In-Depth Information

You have to know the who, what, when, where, why, and how—the metadata,

or the data about the data—before you can know what the numbers are

actually about.

Who: A quote in a major newspaper carries more weight than one from a

celebrity gossip site that has a reputation for stretching the truth. Similarly,

data from a reputable source typically implies better accuracy than a random

online poll.

For example, Gallup, which has measured public opinion since the 1930s, is

more reliable than say, someone (for example, me) experimenting with a small,

one-off Twitter sample late at night during a short period of time. Whereas

the former works to create samples representative of a region, there are

unknowns with the latter.

Speaking of which, in addition to who collected the data, who the data is

about is also important. Going back to the gumballs, it's often not financially

feasible to collect data about everyone or everything in a population. Most

people don't have time to count and categorize a thousand gumballs, much

less a million, so they sample. The key is to sample evenly across the popula-

tion so that it is representative of the whole. Did the data collectors do that?

How: People often skip methodology because it tends to be complex and for

a technical audience, but it's worth getting to know the gist of how the data

of interest was collected.

If you're the one who collected the data, then you're good to go, but when you

grab a dataset online, provided by someone you've never met, how will you

know if it's any good? Do you trust it right away, or do you investigate? You don't

have to know the exact statistical model behind every dataset, but look out for

small samples, high margins of error, and unfit assumptions about the subjects,

such as indices or rankings that incorporate spotty or unrelated information.

Sometimes people generate indices to measure the quality of life in countries,

and a metric like literacy is used as a factor. However, a country might not have

up-to-date information on literacy, so the data gatherer simply uses an estimate

from a decade earlier. That's going to cause problems because then the index

works only under the assumption that the literacy rate one decade earlier is

comparable to the present, which might not be (and probably isn't) the case.

What: Ultimately, you want to know what your data is about, but before you

can do that, you should know what surrounds the numbers. Talk to subject

experts, read papers, and study accompanying documentation.

Search WWH ::

Custom Search

Home