Graphics Reference
In-Depth Information
numeric data may have text entries such as N/A or #ERROR . These
entries should be cleaned or removed.
Units —Data sets can sometimes shift units, as indicated by the final
e-mail showing the size in megabytes (MB), whereas all earlier
examples were in kilobytes (KB). All numeric data needs to be
normalized to the same units.
There are many approaches to dealing with invalid, incomplete, and
inconsistent data. A simple approach may be to remove the particular
problematic record, but other approaches including inputing missing values
or normalizing the data. These are beyond the scope of this topic.
Depending on the data set, privacy issues may need to be addressed—for
example, where people are uniquely identified by name or numbers (such as
a government ID number). In an e-mail data set, the names of individuals
should be replaced with numbers, letters, or generic names. Unique, generic
names can be found in government registries (for example, www.ssa.gov/
OACT/babynames/limits.html ). Corporate policy varies at different
companies, so check the appropriate guidelines. If you are uncertain,
replacing personally identifying information (such as names) with other
data is a good idea.
Connect: Organize Graph Data
By definition, a graph is a collection of nodes and links between the nodes.
Graph software almost always works with a data set of nodes and a data set
oflinks.Evenitnotrequired,conceptually,itcanbeveryeffectivetoidentify
and organize data into a set of nodes and set of links. This will enable data
exploration with a wider variety of tools if this clear separation is available.
Extending the e-mail example, the clean data may look like this:
To, From, CC, Date, Size
"Ben", "Zoe", "", 12/09/2014, 156kb
"Ben", "Zoe", "Tim", 02/02/2014, 25kb
"Ben", "Tim", "Zoe", 11/18/2014, 77kb
"Ben", "Ann", "", 10/31/2014, 2048kb
...
Search WWH ::




Custom Search