Graphics Programs Reference
In-Depth Information
You have to know the who, what, when, where, why, and how—the metadata,
or the data about the data—before you can know what the numbers are
actually about.
Who: A quote in a major newspaper carries more weight than one from a
celebrity gossip site that has a reputation for stretching the truth. Similarly,
data from a reputable source typically implies better accuracy than a random
online poll.
For example, Gallup, which has measured public opinion since the 1930s, is
more reliable than say, someone (for example, me) experimenting with a small,
one-off Twitter sample late at night during a short period of time. Whereas
the former works to create samples representative of a region, there are
unknowns with the latter.
Speaking of which, in addition to who collected the data, who the data is
about is also important. Going back to the gumballs, it's often not financially
feasible to collect data about everyone or everything in a population. Most
people don't have time to count and categorize a thousand gumballs, much
less a million, so they sample. The key is to sample evenly across the popula-
tion so that it is representative of the whole. Did the data collectors do that?
How: People often skip methodology because it tends to be complex and for
a technical audience, but it's worth getting to know the gist of how the data
of interest was collected.
If you're the one who collected the data, then you're good to go, but when you
grab a dataset online, provided by someone you've never met, how will you
know if it's any good? Do you trust it right away, or do you investigate? You don't
have to know the exact statistical model behind every dataset, but look out for
small samples, high margins of error, and unfit assumptions about the subjects,
such as indices or rankings that incorporate spotty or unrelated information.
Sometimes people generate indices to measure the quality of life in countries,
and a metric like literacy is used as a factor. However, a country might not have
up-to-date information on literacy, so the data gatherer simply uses an estimate
from a decade earlier. That's going to cause problems because then the index
works only under the assumption that the literacy rate one decade earlier is
comparable to the present, which might not be (and probably isn't) the case.
What: Ultimately, you want to know what your data is about, but before you
can do that, you should know what surrounds the numbers. Talk to subject
experts, read papers, and study accompanying documentation.
Search WWH ::




Custom Search