Database Reference
In-Depth Information
F Ability to generate data in the right format for custom data types
F Ability to add data to multiple tables, while respecting Foreign Key constraints
between tables
F Ability to add data in non-uniform distributions
See also
The tools and tricks shown here are cool and clever, though there are some problems hiding
here as well. Real data has so many strange things in it that it can be very hard to simulate. One
of the most difficult things is generating data that follows realistic distributions. For example, if
we had to generate data for people's heights, then we'd want to generate data to follow a normal
distribution. If we were generating customer bank balances we'd want to use a Zipf distribution,
or for number of reported insurance claims, perhaps a Poisson distribution (or perhaps not).
Replicating the real quirks in data can take some time.
You can use existing data to generate test databases using sampling. That's the subject
of our next recipe.
Randomly sampling data
DBAs may be asked to set up a test server, and populate it with test data. Often, that server
will be old hardware, possibly with smaller disk sizes. So, the subject of data sampling raises
its head.
The purpose of sampling is to reduce the size of the data set and improve the speed of later
analysis. Some statisticians are so used to the idea of sampling that they may not even
question whether its use is valid, or cause further complications.
How to do it...
First, you should realize that there isn't a simple tool to slice off a sample of your database.
It would be neat if there were, but there isn't. You'll need to read all of this to understand why.
We first need to consider some SQL to derive a sample. Random sampling is actually very
simple, because we can use the SQL function random() within the WHERE clause. For example:
postgres=# SELECT count(*) FROM mybigtable;
count
-------
10000
(1 row)
postgres=# SELECT count(*) FROM mybigtable
WHERE random() < 0.01;
count
 
Search WWH ::




Custom Search