Graphics Reference
In-Depth Information
Examination : Once we've got the data, a thorough examination will determine
your level of confidence in the suitability of what you have acquired. This involves
assessing the completeness and fitness of the data to potentially serve your
needs. There are many tools out there that can help you work through this stage
efficiently. Depending on the size and complexity of your data, and obviously your
own capabilities, software like Excel, Tableau, or Google Refine (among plenty of
others), will enable you to quickly scan, filter, sort, and search through your dataset
to establish its state of quality. As you go through this process, you should be
examining the following potential issues:
Completeness : Is it all there or do you need more? Is the size and shape
consistent with your expectations? Does it have all the categories you were
expecting? Does it cover the time period you wanted? Are all the fields or
variables included? Does it contain the expected number of records?
Quality : Are there noticeable errors? Are there any unexplained classifications
or coding? Any formatting issues such as unusual dates, ASCII characters? Are
there any incomplete or missing items? Any duplicates? Does the accuracy of
the data appear fine? Are there any unusual values or obvious outliers?
Data types : Understanding the properties of our raw material is such an important
task. We will do some visual exploring later to learn about the physical patterns
and relationships but, for now, we need to understand the fundamental structure of
our data in terms of the variables types. This will become important when we move
into the design discussion in Chapter 4 , Preparing and Familiarizing With Data . The
following table outlines the discrete types of data with associated examples:
Types
Examples
Categorical nominal
Countries, gender, text
Categorical ordinal
Olympic medals, "Likert" scale
Quantitative (interval-scale)
Dates, temperature
Quantitative (ratio-scale)
Prices, age, distance
As well as capturing the types of data we have, it is a useful exercise to also make
a note of the range of values or at least a sample of the data held against each field.
For illustration, this might be from a dataset about the Olympics:
Data
Types
Range
Event
Quantitative (interval-scale)
27 different years (1896-2012)
Medal
Categorical ordinal
Gold, silver, bronze
Athlete
Categorical nominal
1500+ different athlete names
Search WWH ::




Custom Search