Databases Reference
In-Depth Information
need. This is ideal for creating materialized views and storing them in your RDBMS s or
NoSQL database.
Although Apache Flume was originally written for processing log files, it's a general-
purpose tool and can be used on other types of immutable big data problems such as
data loggers or raw data from web crawling systems. As data loggers get lower in price,
tools like Apache Flume will be needed to preprocess more big data problems.
6.10
Case study: computer-aided discovery
of health care fraud
In this case study, we'll take a look at a problem that can't be easily solved using a
shared-nothing architecture. This is the problem of looking for patterns of fraud
using large graphs. Highly connected graphs aren't partition tolerant—meaning that
you can't divide the queries on a graph on two or more shared-nothing processors. If
your graph is too large to fit in the RAM of a commodity processor, you may need to
look at an alternative to a shared-nothing system.
This case study is important because it explores the limits of what a cluster of
shared-nothing systems can do. We include this case study because we want to avoid a
tendency for architects to recommend large shared-nothing clusters for all problems.
Although shared-nothing architectures work for many big data problems, they don't
provide for linear scaling of highly connected data such as graphs or RDBMS s contain-
ing joins. Looking for hidden patterns in large graphs is one area that's best solved
with a custom hardware approach.
6.10.1
What is health care fraud detection?
The US Congressional Office of Management and Budget estimates that improper
payments in Medicare and Medicaid came to $50.7 billion in 2010, nearly 8.5% of the
annual Medicare budget. A portion of this staggering figure is the result of improper
documentation, but it's certain that Medicare fraud costs taxpayers tens of billions of
dollars annually.
Existing efforts to detect fraud have focused on searching for suspicious submis-
sions from individual beneficiaries and health care providers. These efforts yielded
$4.1 billion in fraud recovery in 2011, around 10% of the total estimated fraud.
Unfortunately, fraud is becoming more sophisticated, and detection must move
beyond the search for individuals to the discovery of patterns of collusion among mul-
tiple beneficiaries and/or health care providers. Identifying these patterns is challeng-
ing, as fraudulent behaviors continuously change, requiring the analyst to hypothesize
that a pattern of relationships could indicate fraud, visualize and evaluate the results,
and iteratively refine their hypothesis.
Search WWH ::




Custom Search