Databases Reference
In-Depth Information
most of the data to be analyzed is structured, Oracle Database 12 c pattern matching
capabilities might suffice for processing the semi-structured data.
Where huge data volumes of semi-structured or unstructured data are being gathered,
separate Hadoop clusters are used to filter data and MapReduce results are often loaded
into an Oracle data warehouse. Oracle offers an engineered system for deploying Ha‐
doop clusters called the Oracle Big Data Appliance (BDA). A Full Rack consisted of 18
nodes, 648 terabytes of disk, 1,152 GB of memory, 288 processing cores, and an Infini‐
Band interconnect when this edition of Oracle Essentials was published. Starter Racks
populated with 6 nodes were also available. Multiple Full Racks can be connected via
InfiniBand using internal switches in the Racks. The foundation software for the BDA
includes the Cloudera Distribution of Hadoop (CDH), Cloudera Manager, Oracle
NoSQL Database Community Edition, Java VM, and Linux. In a standard configuration,
data is distributed across the entire platform using HDFS and triple replicated. As with
other engineered systems, Oracle provides a single point of support for the entire
configuration.
Loading Data into the Data Warehouse
Experienced data warehouse architects realize that the process of understanding the
data sources, designing transformations, testing the loading process, and debugging is
often the most time-consuming part of deployment. Transformations are used to re‐
move bogus data (including erroneous entries and duplicate entries), convert data items
to an agreed-upon format, and filter data not considered necessary for the warehouse.
These operations are often used to improve the quality of data loaded into the ware‐
house.
The frequency of data extraction from sources and loading into the data warehouse is
largely determined by the required timeliness of the data in order to make business
decisions. Most data extraction and loading takes place on a “batch” basis and data
transformations cause a time delay. Early warehouses were often completely refreshed
during the loading process, but as data volumes grew, this became impractical. Today,
updates to tables are most common. When a need for near real-time data exists, ware‐
houses can be loaded nearly continuously using a trickle feed if the source data is rela‐
tively clean, eliminating the need for complex transformations. If real-time feeds are
not possible but real-time recommendations are needed, engines such as Oracle's Real-
time Decisions are deployed.
Is Cleanliness Best?
Once the data in the warehouse is “clean,” is this version of the true nature of the data
propagated back to the originating OLTP systems? This is an important issue for data
warehouse implementation. In some cases, a “closed loop” process is implemented
 
Search WWH ::




Custom Search