Database Reference
In-Depth Information
As with the previous example of deciding which data to keep as it relates to fraud
detection on credit card usage, it is critical to be thoughtful about which data the
team chooses to keep and which data will be discarded. This can have far-reaching
consequences that will cause the team to retrace previous steps if the team discards
too much of the data at too early a point in this process. Typically, data science
teams would rather keep more data than too little data for the analysis. Additional
questions and considerations for the data conditioning step include these.
• What are the data sources? What are the target fields (for example,
columns of the tables)?
• How clean is the data?
• How consistent are the contents and files? Determine to what degree the
data contains missing or inconsistent values and if the data contains
values deviating from normal.
• Assess the consistency of the data types. For instance, if the team expects
certain data to be numeric, confirm it is numeric or if it is a mixture of
alphanumeric strings and text.
• Review the content of data columns or other inputs, and check to ensure
they make sense. For instance, if the project involves analyzing income
levels, preview the data to confirm that the income values are positive or if
it is acceptable to have zeros or negative values.
• Look for any evidence of systematic error. Examples include data feeds
from sensors or other data sources breaking without anyone noticing,
which causes invalid, incorrect, or missing data values. In addition, review
the data to gauge if the definition of the data is the same over all
measurements. In some cases, a data column is repurposed, or the column
stops being populated, without this change being annotated or without
others being notified.
2.3.5 Survey and Visualize
After the team has collected and obtained at least some of the datasets needed
for the subsequent analysis, a useful step is to leverage data visualization tools to
gain an overview of the data. Seeing high-level patterns in the data enables one
to understand characteristics about the data very quickly. One example is using
data visualization to examine data quality, such as whether the data contains many
unexpected values or other indicators of dirty data. (Dirty data will be discussed
Search WWH ::




Custom Search