Graphics Reference
In-Depth Information
to operate on messy data, and it is your task to first clean and prepare data
before providing it to graph software.
Carrying on with the senior sales rep example outlined previously, let's use a
datasetof10,000e-mailsasanillustration.Eachpersonwhosent,received,
or was Cc'd will be a node. Links will be formed between any pair of people
included in the same e-mail.
Because the actual messages are not required, only the metadata is
exported: To, From, Cc, Bcc, Date, e-mail size, and so on. The exported data
set ideally will look like this, with each row indicating one e-mail message
between a group of people:
To, From, CC, Date, Size
"Ben", "Zoe", "", 12/09/2014, 156kb
"Ben", "Zoe", "Tim", 02/02/2014, 25kb
"Ben", "Tim", "Zoe", 11/18/2014, 77kb
"Ben", "Ann", "", 10/31/2014, 2048kb
...
Unfortunately, real data is rarely as tidy and error-free as the data shown
here. A real-world e-mail data file may look more like this (with various
anomalies shown underlined):
To, From, CC, Date, Size
"Ben", "Zoe", "", 12/09/2014, 156kb
"Ben", "Zoe Jones", "Tim", 02/02/2014, 25kb
"Ben", "Tim", "Tim; Zoe", 11/09/2014, 77kb
"Ben", "Ann", 76.3, n/a, 2048kb
"Ben", "", "", 01/01/2014, 4.2Mb
...
In this example of dirty data, many data-quality issues must be addressed
before constructing the graph data:
Inconsistent node names —Nodes are not consistently named. In
this example, both “Zoe” and “Zoe Jones” refer to the same person. In
real-world data, this can get quite messy. For example, in one e-mail
data set you may find that “John Doe” also appeared as
“john.doe@bigco.com” or “Doe, John,” with or without surrounding
quotes, with prefixes (for example, “SMTP: john.doe@bigco.com”) and/
Search WWH ::




Custom Search