Oracle Data Warehousing and Business Intelligence - Oracle Essentials

Databases Reference

In-Depth Information

most of the data to be analyzed is structured, Oracle Database 12 c pattern matching

capabilities might suffice for processing the semi-structured data.

Where huge data volumes of semi-structured or unstructured data are being gathered,

separate Hadoop clusters are used to filter data and MapReduce results are often loaded

into an Oracle data warehouse. Oracle offers an engineered system for deploying Ha‐

doop clusters called the Oracle Big Data Appliance (BDA). A Full Rack consisted of 18

nodes, 648 terabytes of disk, 1,152 GB of memory, 288 processing cores, and an Infini‐

Band interconnect when this edition of Oracle Essentials was published. Starter Racks

populated with 6 nodes were also available. Multiple Full Racks can be connected via

InfiniBand using internal switches in the Racks. The foundation software for the BDA

includes the Cloudera Distribution of Hadoop (CDH), Cloudera Manager, Oracle

NoSQL Database Community Edition, Java VM, and Linux. In a standard configuration,

data is distributed across the entire platform using HDFS and triple replicated. As with

other engineered systems, Oracle provides a single point of support for the entire

configuration.

Loading Data into the Data Warehouse

Experienced data warehouse architects realize that the process of understanding the

data sources, designing transformations, testing the loading process, and debugging is

often the most time-consuming part of deployment. Transformations are used to re‐

move bogus data (including erroneous entries and duplicate entries), convert data items

to an agreed-upon format, and filter data not considered necessary for the warehouse.

These operations are often used to improve the quality of data loaded into the ware‐

house.

The frequency of data extraction from sources and loading into the data warehouse is

largely determined by the required timeliness of the data in order to make business

decisions. Most data extraction and loading takes place on a “batch” basis and data

transformations cause a time delay. Early warehouses were often completely refreshed

during the loading process, but as data volumes grew, this became impractical. Today,

updates to tables are most common. When a need for near real-time data exists, ware‐

houses can be loaded nearly continuously using a trickle feed if the source data is rela‐

tively clean, eliminating the need for complex transformations. If real-time feeds are

not possible but real-time recommendations are needed, engines such as Oracle's Real-

time Decisions are deployed.

Is Cleanliness Best?

Once the data in the warehouse is “clean,” is this version of the true nature of the data

propagated back to the originating OLTP systems? This is an important issue for data

warehouse implementation. In some cases, a “closed loop” process is implemented

Search WWH ::

Custom Search

Home