Data—Collect, Clean, and Connect - Graph Analysis and Visualization

Graphics Reference

In-Depth Information

or suffixes (for example, “John Doe (Email)”). These need to be

consolidated into a single record.

• Duplicate nodes —Within the node data set, each node should appear

only once. For example, “Zoe Jones” should occur only one time. If

multiple “Zoe Jones” occur in the data and all refer to the same Zoe

Jones, these should be aggregated into a single record. If two different

Zoe Jones are employed, then the node should be identified with a

unique identifier (for example, an e-mail address or employee number).

• Duplicate links —Some types of graph visualization and analysis

software do not work well with many links between the same pair of

nodes, and these must be consolidated. It is quite common to have

many links in the data between the same pair of nodes based on

additional attributes. For example, in the Flight_Stats data set

provided in the Supplementary Material on this topic's companion

website, there may be multiple flights on a given day between a pair of

cities at different times, on different airlines. If the objective is to

understand the number of flights between each city pair, these must be

consolidated down to a single link for that city pair. Alternatively, if the

objective is to analyze each of the different carrier networks, the

different links must be maintained, and the analysis tools chosen must

handle multiple links between points.

• Self-loop —A node that has a link that connects to itself is a self-loop .

In the third e-mail of the previous example, Tim has sent an e-mail to

Ben and Zoe, but also Cc'd himself, thus creating a self-loop. Self-loops

may not be relevant to the analytic objectives. Self-loops are not

handled in some graph software.

• Isolated nodes —In the final e-mail shown previously, no From or Cc

is identified. It is feasible to have nodes in data sets to which no links

exist—on some occasions graph programs may have problems with

unlinked nodes.

• Links pointing to nonexistent nodes —Although this does not

occur in the previous example, in some data sets, a link may be defined

between two nodes, where one of the nodes does not exist in the list of

nodes. This may cause problems with some graph software.

• Invalid data —Unfortunately, real-world data consists of fields that

may be empty, NULL , or may otherwise have invalid data. A column of

Search WWH ::

Custom Search

Home