Data—Collect, Clean, and Connect - Graph Analysis and Visualization

Graphics Reference

In-Depth Information

to operate on messy data, and it is your task to first clean and prepare data

before providing it to graph software.

Carrying on with the senior sales rep example outlined previously, let's use a

datasetof10,000e-mailsasanillustration.Eachpersonwhosent,received,

or was Cc'd will be a node. Links will be formed between any pair of people

included in the same e-mail.

Because the actual messages are not required, only the metadata is

exported: To, From, Cc, Bcc, Date, e-mail size, and so on. The exported data

set ideally will look like this, with each row indicating one e-mail message

between a group of people:

To, From, CC, Date, Size

"Ben", "Zoe", "", 12/09/2014, 156kb

"Ben", "Zoe", "Tim", 02/02/2014, 25kb

"Ben", "Tim", "Zoe", 11/18/2014, 77kb

"Ben", "Ann", "", 10/31/2014, 2048kb

...

Unfortunately, real data is rarely as tidy and error-free as the data shown

here. A real-world e-mail data file may look more like this (with various

anomalies shown underlined):

To, From, CC, Date, Size

"Ben", "Zoe", "", 12/09/2014, 156kb

"Ben", "Zoe Jones", "Tim", 02/02/2014, 25kb

"Ben", "Tim", "Tim; Zoe", 11/09/2014, 77kb

"Ben", "Ann", 76.3, n/a, 2048kb

"Ben", "", "", 01/01/2014, 4.2Mb

...

In this example of dirty data, many data-quality issues must be addressed

before constructing the graph data:

• Inconsistent node names —Nodes are not consistently named. In

this example, both “Zoe” and “Zoe Jones” refer to the same person. In

real-world data, this can get quite messy. For example, in one e-mail

data set you may find that “John Doe” also appeared as

“john.doe@bigco.com” or “Doe, John,” with or without surrounding

quotes, with prefixes (for example, “SMTP: john.doe@bigco.com”) and/

Search WWH ::

Custom Search

Home