Databases Reference
In-Depth Information
Ad-hoc Queries and What-If Simulations . Data analysts run a set of
queries to analyze and find answers to specific business questions,
as these questions are above and beyond the standard set of KPIs.
During the course of these ad-hoc analyses the data analysts also
perform several what-if scenarios while planning analysis.
The ad-hoc tasks have a defined life cycle. Once the data analysts are able to find
answers to new business questions, the queries are incorporated into a report so that
business users can monitor it as an ongoing practice. In a typical data warehouse, the
bulk of tasks (80 percent) are from reports. The remaining 20 percent is from ad-hoc.
Since reports are frequent and generate known queries, the design of the data
warehouse is done to cater to reporting. This includes data models, indexes, materialized
views or derived tables (and other optimizations) to make the known reporting queries
perform optimally.
Since ad-hoc tasks are infrequent and generate unknown queries, the design of the
data warehouse is unable to cater to them upfront. This means that ad-hoc tasks generate
queries that are harder to satisfy (since they are constrained by the data modeling
decisions made for reporting) and therefore impose more load on the data warehouse.
The net result is that reports run fast while ad-hoc queries are slow. In fact, ad-hoc
queries consume so much resource that reporting applications run slower, and that is
not good: reports are distributed widely and reach a wide variety of business users. They
are unhappy and put pressure on the data warehousing team to “get the reports in time.”
At the same time, the ad-hoc users are unhappy because they can't get to the data fast
enough to benefit the business.
Historically, the answer to this deadlock situation was to prioritize via workload
management and constrain ad-hoc usage to devote resources to reporting. If workload
management didn't work, the answer was to define rules such as: “reports will not be
refreshed when data is loading”; ad-hoc queries should not be run when reports are being
generated.”
Let's call this design pattern of data warehouse a reporting data warehouse .
The priority objective of a reporting data warehouse is to ensure reports are accurate and
that they perform optimally. The ad-hoc tasks are not treated with priority: they don't get
dedicated data models or large chunks of resources; their tasks were heavily monitored,
and often they are asked to curtail their requirements (use samples, use rolled up
aggregates that were built to make reports faster, use smaller timeframes of history that
were retained to just satisfy reporting requirements, phrase queries that are simpler even
though they may be compromises on the pattern sought, so on and so forth).
However, with big data platforms all these constraints on ad-hoc tasks become
meaningless. And with current technology advances we have the ability to address the
constraints and issues discussed above.
This motivates the definition of a different design pattern of a data warehouse:
a big data and analytics data warehouse . The priority objective of a big data analytics
data warehouse is to provide capabilities for ad-hoc analytics, and the primary users of
the big data and analytics data warehouse are data analysts. The platform would support
schema-less data ingestion architecture that will allow the data analysts to integrate any
data source into the system, the data models are built to support their ad-hoc usage:
fine-granularity data is retained, rich dimension tables are frequently imported, derived
views and tables are created promptly, interfaces are opened up to express their patterns
 
Search WWH ::




Custom Search