Data Analytics Lifecycle - Data Science and Big Data Analytics

Database Reference

In-Depth Information

raw condition. The reason for this approach is that there is significant value in

preserving the raw data and including it in the sandbox before any transformations

take place.

For instance, consider an analysis for fraud detection on credit card usage. Many

times, outliers in this data population can represent higher-risk transactions that

may be indicative of fraudulent credit card activity. Using ETL, these outliers

may be inadvertently filtered out or transformed and cleaned before being loaded

into the datastore. In this case, the very data that would be needed to evaluate

instances of fraudulent activity would be inadvertently cleansed, preventing the

kind of analysis that a team would want to do.

Following the ELT approach gives the team access to clean data to analyze after the

data has been loaded into the database and gives access to the data in its original

form for finding hidden nuances in the data. This approach is part of the reason

that the analytic sandbox can quickly grow large. The team may want clean data

and aggregated data and may need to keep a copy of the original data to compare

against or look for hidden patterns that may have existed in the data before the

cleaning stage. This process can be summarized as ETLT to reflect the fact that a

team may choose to perform ETL in one case and ELT in another.

Depending on the size and number of the data sources, the team may need to

consider how to parallelize the movement of the datasets into the sandbox. For this

purpose, moving large amounts of data is sometimes referred to as Big ETL. The

data movement can be parallelized by technologies such as Hadoop or MapReduce,

which will be explained in greater detail in Chapter 10, “Advanced

Analytics—Technology and Tools: MapReduce and Hadoop.” At this point, keep

in mind that these technologies can be used to perform parallel data ingest and

introduce a huge number of files or datasets in parallel in a very short period

of time. Hadoop can be useful for data loading as well as for data analysis in

subsequent phases.

Prior to moving the data into the analytic sandbox, determine the transformations

that need to be performed on the data. Part of this phase involves assessing data

quality and structuring the datasets properly so they can be used for robust

analysis in subsequent phases. In addition, it is important to consider which data

the team will have access to and which new data attributes will need to be derived

in the data to enable analysis.

As part of the ETLT step, it is advisable to make an inventory of the data and

compare the data currently available with datasets the team needs. Performing this

Search WWH ::

Custom Search

Home