Data Analytics Lifecycle - Data Science and Big Data Analytics

Database Reference

In-Depth Information

As with the previous example of deciding which data to keep as it relates to fraud

detection on credit card usage, it is critical to be thoughtful about which data the

team chooses to keep and which data will be discarded. This can have far-reaching

consequences that will cause the team to retrace previous steps if the team discards

too much of the data at too early a point in this process. Typically, data science

teams would rather keep more data than too little data for the analysis. Additional

questions and considerations for the data conditioning step include these.

• What are the data sources? What are the target fields (for example,

columns of the tables)?

• How clean is the data?

• How consistent are the contents and files? Determine to what degree the

data contains missing or inconsistent values and if the data contains

values deviating from normal.

• Assess the consistency of the data types. For instance, if the team expects

certain data to be numeric, confirm it is numeric or if it is a mixture of

alphanumeric strings and text.

• Review the content of data columns or other inputs, and check to ensure

they make sense. For instance, if the project involves analyzing income

levels, preview the data to confirm that the income values are positive or if

it is acceptable to have zeros or negative values.

• Look for any evidence of systematic error. Examples include data feeds

from sensors or other data sources breaking without anyone noticing,

which causes invalid, incorrect, or missing data values. In addition, review

the data to gauge if the definition of the data is the same over all

measurements. In some cases, a data column is repurposed, or the column

stops being populated, without this change being annotated or without

others being notified.

2.3.5 Survey and Visualize

After the team has collected and obtained at least some of the datasets needed

for the subsequent analysis, a useful step is to leverage data visualization tools to

gain an overview of the data. Seeing high-level patterns in the data enables one

to understand characteristics about the data very quickly. One example is using

data visualization to examine data quality, such as whether the data contains many

unexpected values or other indicators of dirty data. (Dirty data will be discussed

Search WWH ::

Custom Search

Home