Data Warehousing Revisited - Data Warehousing in the Age of Big Data

Databases Reference

In-Depth Information

Continuing on understanding how workloads can impact the performance, for a transactional

query the workload is streamlined and can be compared to automobiles on expressways. There is

a smooth flow of traffic and all drivers follow lane management principles. The flow is interrupted

when a breakdown, accident, or large volume of big rigs merge in the traffic. On a similar note, you

will see a slowdown or failure of query processing if there is an infrastructure breakdown or a large

and complex query that might have been generated by an ad-hoc user, which will resemble traffic

jams on any large-city expressway when an accident occurs.

The hidden complexities of data warehouse queries are not discernable by visual inspection. The

reason for this is the design and architecture of the semantic layers built on top of the data warehouse,

which allows the user to navigate the relationships in the underlying data and ask questions that can

drive workload and optimization demands in multiple layers, and sometimes creates conflicts between

the optimization and its workload model in the database.

For example, to determine the sales dollars per month and year for each customer in a household,

the query on a data warehouse will look as follows:

Select cust.customer_name, cust.cust_id, household.household_id, household.

state, sum(sales.net_sales_amt) net_sales, sales.month_num, sales.year_num from

cust,household,sales where cust.household_id = household.household_id and cust.cust_

id = sales.cust_id group by cust.customer_name, cust.cust_id, household.household_id,

sales.month_num, sales.year_num

Let us assume the tables sizes are:

●

Customer: 300,000 rows

●

Household: 190,000 rows

●

Sales: 75,000,000 rows

To execute this query, the data warehouse will perform several cycles of computational processing

across layers of infrastructure, as shown in Figure 6.3 :

●

There are large data sets that will be read from the disk and moved to the memory and back to the

disk. This will lead to excessive data swapping between different layers on a continued basis.

●

Additionally, the same data may be needed multiple times and depending on how stale or fresh

the data is, the query might have to repeat the processing steps since there is not enough space to

persist the result sets.

●

Depending on the configuration setting of the database and the server, the memory and CPU are

divided between the multiple threads that are generated by the query for processing. For example,

a dual-core CPU can execute four threads—two per CPU. In the case of an average server the

CPU can execute 32 threads for eight CPUs. If there are 32 threads and eight users running, the

CPU cycles will be divided for the eight users.

●

If there are multiple queries executing operations in parallel, there will be contention for

resources, as everybody will be on the same shared-everything architecture. Multiple threads of

execution can decrease throughput and increase swapping of data and slowdown of execution.

●

If the data warehouse load operations execute in parallel with the query processing, they can

evolve into sync issues especially in circular references.

●

A mix of query and load operations can cause the database to underperform in the shared

architecture as it creates a mixed workload.

●

Adding ad-hoc users and data mining type of operations can cause further underperformance.

Data Warehousing in the Age of Big Data

Search WWH ::

Custom Search

Home