Data Modeling Approaches for Big Data and Analytics Solutions - Big Data Imperatives

Databases Reference

In-Depth Information

• Ad-hoc Queries and What-If Simulations . Data analysts run a set of

queries to analyze and find answers to specific business questions,

as these questions are above and beyond the standard set of KPIs.

During the course of these ad-hoc analyses the data analysts also

perform several what-if scenarios while planning analysis.

The ad-hoc tasks have a defined life cycle. Once the data analysts are able to find

answers to new business questions, the queries are incorporated into a report so that

business users can monitor it as an ongoing practice. In a typical data warehouse, the

bulk of tasks (80 percent) are from reports. The remaining 20 percent is from ad-hoc.

Since reports are frequent and generate known queries, the design of the data

warehouse is done to cater to reporting. This includes data models, indexes, materialized

views or derived tables (and other optimizations) to make the known reporting queries

perform optimally.

Since ad-hoc tasks are infrequent and generate unknown queries, the design of the

data warehouse is unable to cater to them upfront. This means that ad-hoc tasks generate

queries that are harder to satisfy (since they are constrained by the data modeling

decisions made for reporting) and therefore impose more load on the data warehouse.

The net result is that reports run fast while ad-hoc queries are slow. In fact, ad-hoc

queries consume so much resource that reporting applications run slower, and that is

not good: reports are distributed widely and reach a wide variety of business users. They

are unhappy and put pressure on the data warehousing team to “get the reports in time.”

At the same time, the ad-hoc users are unhappy because they can't get to the data fast

enough to benefit the business.

Historically, the answer to this deadlock situation was to prioritize via workload

management and constrain ad-hoc usage to devote resources to reporting. If workload

management didn't work, the answer was to define rules such as: “reports will not be

refreshed when data is loading”; ad-hoc queries should not be run when reports are being

generated.”

Let's call this design pattern of data warehouse a reporting data warehouse .

The priority objective of a reporting data warehouse is to ensure reports are accurate and

that they perform optimally. The ad-hoc tasks are not treated with priority: they don't get

dedicated data models or large chunks of resources; their tasks were heavily monitored,

and often they are asked to curtail their requirements (use samples, use rolled up

aggregates that were built to make reports faster, use smaller timeframes of history that

were retained to just satisfy reporting requirements, phrase queries that are simpler even

though they may be compromises on the pattern sought, so on and so forth).

However, with big data platforms all these constraints on ad-hoc tasks become

meaningless. And with current technology advances we have the ability to address the

constraints and issues discussed above.

This motivates the definition of a different design pattern of a data warehouse:

a big data and analytics data warehouse . The priority objective of a big data analytics

data warehouse is to provide capabilities for ad-hoc analytics, and the primary users of

the big data and analytics data warehouse are data analysts. The platform would support

schema-less data ingestion architecture that will allow the data analysts to integrate any

data source into the system, the data models are built to support their ad-hoc usage:

fine-granularity data is retained, rich dimension tables are frequently imported, derived

views and tables are created promptly, interfaces are opened up to express their patterns

Big Data Imperatives

Search WWH ::

Custom Search

Home