Database Reference
In-Depth Information
raw condition. The reason for this approach is that there is significant value in
preserving the raw data and including it in the sandbox before any transformations
take place.
For instance, consider an analysis for fraud detection on credit card usage. Many
times, outliers in this data population can represent higher-risk transactions that
may be indicative of fraudulent credit card activity. Using ETL, these outliers
may be inadvertently filtered out or transformed and cleaned before being loaded
into the datastore. In this case, the very data that would be needed to evaluate
instances of fraudulent activity would be inadvertently cleansed, preventing the
kind of analysis that a team would want to do.
Following the ELT approach gives the team access to clean data to analyze after the
data has been loaded into the database and gives access to the data in its original
form for finding hidden nuances in the data. This approach is part of the reason
that the analytic sandbox can quickly grow large. The team may want clean data
and aggregated data and may need to keep a copy of the original data to compare
against or look for hidden patterns that may have existed in the data before the
cleaning stage. This process can be summarized as ETLT to reflect the fact that a
team may choose to perform ETL in one case and ELT in another.
Depending on the size and number of the data sources, the team may need to
consider how to parallelize the movement of the datasets into the sandbox. For this
purpose, moving large amounts of data is sometimes referred to as Big ETL. The
data movement can be parallelized by technologies such as Hadoop or MapReduce,
which will be explained in greater detail in Chapter 10, “Advanced
Analytics—Technology and Tools: MapReduce and Hadoop.” At this point, keep
in mind that these technologies can be used to perform parallel data ingest and
introduce a huge number of files or datasets in parallel in a very short period
of time. Hadoop can be useful for data loading as well as for data analysis in
subsequent phases.
Prior to moving the data into the analytic sandbox, determine the transformations
that need to be performed on the data. Part of this phase involves assessing data
quality and structuring the datasets properly so they can be used for robust
analysis in subsequent phases. In addition, it is important to consider which data
the team will have access to and which new data attributes will need to be derived
in the data to enable analysis.
As part of the ETLT step, it is advisable to make an inventory of the data and
compare the data currently available with datasets the team needs. Performing this
Search WWH ::




Custom Search